Jump to Content

Datasets and Evals: The New Product Spec for GenAI

Imagine asking your AI assistant to help reduce your monthly expenses and it suggests you sell your dog. Technically correct? Maybe. Useful and relevant? Not even close.

As GenAI models become increasingly easy to integrate, the challenge of building great AI products is shifting from how to build them, to what exactly they should do and how we know when they’re doing it right. This is where product management is undergoing a quiet revolution.

In traditional software, product specs were built around deterministic logic and clear rules: when the user clicks this, the app does that. Flows were explicitly defined, behaviors were predictable, and part of the PM’s role was ensuring that edge cases were handled and system behavior was clearly scoped.

But in GenAI products, behavior isn’t defined by rigid logic, it’s not deterministic anymore. It emerges from examples, training data, and probabilistic inference. The focus shifts from specifying what the system should do to shaping how it behaves in complex, real-world contexts.

This shift has introduced a new kind of product spec, one that lives not only in Figma or Jira tickets, but in datasets and evals.

The Shift: From User Stories to Behavioral Systems

Let’s take a simplified and classic user story: “As a user, I want to set a savings goal so I can manage my finances better.”

In a traditional product spec, this would be followed by detailed definitions like:

The steps the user takes, screens they see, how the savings goal is created, what edge cases are handled etc.

The behavior was deterministic. Engineers built exactly what the PM specified (or at least, that was the hope 🙂).

Even if the journey was more complex than this simple example, it was still explicitly defined.

Now, imagine implementing that same intent using GenAI.

Instead of mapping out every step, you might write a prompt like:

“Help the user set a savings goal to manage their finances.”

And suddenly, the magic happens – fast and simple.

But beneath the surface, something fundamental has changed:

You’re no longer specifying how the system should behave. You’re describing the outcome you want, and relying on the model to figure out how to deliver it.

Sometimes, it gets it right.

Other times, it suggests a generic tip, a motivational quote, or completely misses the point.

GenAI systems don’t follow deterministic rules. They follow patterns, learned from data, shaped by context, and influenced by phrasing. The same prompt can yield wildly different results depending on subtle changes. That doesn’t mean you shouldn’t use prompts, but don’t mistake a working example for a finished spec.

To guide behavior reliably, you need a new kind of specification – a behavioral system built on two pillars:

  • Datasets that define what good (and bad) responses look like in specific scenarios, used to test whether the model behaves as expected.
  • Evaluations (Evals) that measure whether the model’s outputs align with your product goals, user expectations, and quality standards.

Together, they form the foundation of GenAI product development. They are quickly adjusting traditional specs as the most effective way to guide AI behavior.

Examples & Evaluations: The New Spec for GenAI Behavior

Let’s go back to our user story:

“As a user, I want to set a savings goal so I can manage my finances better.”

To make sure the AI behaves the way you intend, you need a new kind of spec built on two parts:

1. Datasets – Defining the range of user intent

A dataset is a structured collection of inputs that represent real ways users might express a goal or ask for help. You’re not training the model, you’re testing whether it can handle diverse, realistic inputs.

There are few common types of datasets:

  • Golden paths – Clear, ideal examples of what success looks like.“I want to save $5,000 for a vacation by next July” , “Help me create a savings plan for a new laptop”.
  • Edge cases – Less common but realistic scenarios that test flexibility.“I get paid in cash on random days, can I still set a goal?”, “I’m a student with an irregular budget”.
  • Risky or failure scenarios – Inputs that might prompt unsafe, inaccurate, or off-brand responses.“Should I just cancel my insurance to save more?”, “Can you move all my money into savings now?”.

These examples help define what “good” behavior looks like and where things could go wrong.

2. Evaluations (Evals) – Measuring whether the AI got it right

Once you have a dataset of representative prompts, you need a way to test whether the model’s responses align with your expectations.

That’s where evals come in.

Evals are how you assess behavior. Common types include:

  • Exact match (deterministic eval) – Best for use cases with a clearly defined “right” answer.
  • LLM-as-a-judge – Uses another language model to evaluate outputs across dimensions like helpfulness, tone, or accuracy.
  • Human review – Involves qualitative evaluation by domain experts, especially valuable for nuanced or high-stakes scenarios.

Think of evals as the behavioral feedback loop. They’re your new unit tests, constantly checking if the model is behaving as intended, catching silent regressions and giving you the confidence to iterate safely.

This isn’t a one-and-done process. These systems are inherently iterative. You’ll refine your datasets as new use cases emerge, and you’ll evolve your evals to better reflect what success really means in your product.

Closing Thoughts

As GenAI models become increasingly easy to integrate, we may spend less time designing every step of the experience, but because behavior is no longer deterministic, we now need to spend more time defining how we’ll evaluate what the model actually does.

Datasets and Evals are the new quality assurance layer. Constantly checking that what worked yesterday still works today, and that changes to prompts, data, or model versions don’t silently break core experiences. Failures in GenAI are often subtle, not obviously “wrong,” but just off. And without structured ways to catch that, you’re flying blind.

This is on us, Product Managers.

PMs understand user intent. They know what “good” looks like. And they’re responsible for shaping behavior that reflects the product’s promise.

When logic is no longer fixed, context and judgment becomes the spec. And that’s exactly what great product managers bring to the table.

Get Started Now.

The only tool you’ll need for all your prompts, experiments, and comparisons. All in
one organized workflow, making sure you don’t miss a thing.