Building Golden Evaluation Datasets from Production Traces

September 17, 20256 min read

I used to run evals on whatever dataset I had around. It was noisy, and I kept chasing false regressions that were just bad data. The fix was simple: build a small, versioned "golden dataset" from real traces and keep it stable.

A golden dataset is a regression safety net: a stable, representative set of real examples that you rerun after every prompt or model change. If you already collect traces in LangSmith, you can turn those runs into a high-signal dataset in a few steps.

Here's the workflow I actually use. It is not fancy: harvest traces, curate a subset, label it, add schema and tags, then run evals and track changes over time.

In 30 seconds

Built a small, stable golden eval set from real traces to stop chasing noisy regressions.
Workflow: harvest -> curate ~100 -> label/rubric -> enforce schema -> tags -> evals -> version updates.
Repeatable prompt/model comparisons + earlier regression detection (before users complain).

Key takeaways

Golden set = regression safety net, not a trace dump.
Schema + tags turn replay into signal.
Keep it small, stable, and updated deliberately.

Why golden datasets matter

Prompt tweaks and model swaps happen fast, and production behavior drifts faster. Without a stable eval set, you usually learn about regressions from frustrated users. A golden set helps by giving you:

A repeatable benchmark you trust.
A baseline for prompt and model comparisons.
Early regression detection before users complain.
A small set you can run frequently (daily or in CI).

The key is that the data is real. You are not guessing what users ask. You are measuring it.

Example: licensing assistant in production

Imagine a LangGraph assistant that helps sales reps quote and manage software licenses for Microsoft and Bitdefender, a setup I have worked on. Recommendations start from the product database, but selection and adaptation happen in the LLM/agent. Users ask for bundle recommendations, renewal terms, and region-specific compliance. A wrong SKU or term can kill a deal. These weren’t edge cases — they showed up often enough to matter.

In that setting, a golden dataset built from real traces helps catch regressions like:

A 2-year renewal is shortened to 1 year.
A region-restricted SKU is being suggested outside its allowed market.
A missing compliance clause in the generated quote summary.

What makes a dataset "golden"

A golden dataset is not just a random dump of traces. It has three properties:

Representative

It covers your top user intents.
It includes failure-prone queries and edge cases.

Labeled

It has reference answers or scoring rubrics (exact SKUs, correct terms, required clauses).
It is structured enough for automated evaluators.

Stable and versioned

You can rerun it across weeks and compare results.
You know exactly when and why it changed.

If you only have one dataset in LangSmith, make that one your golden set.

The workflow I use

Step 1: Harvest candidate traces

Start by pulling recent traces from the flows that matter most (your main product paths). Use a fixed time window (for example, the last 7 days) so you are comparing the same traffic conditions over time. I usually stick to a normal week unless there was a big launch or outage. If you need a quick refresher on datasets and runs, the official LangSmith docs are a good reference. This is intentionally non-runnable and focuses on the flow. Pseudo-code example:

# Pseudo-code: harvest traces and create a golden dataset
runs = langsmith.list_runs(project_name="licensing-assistant", since_days=7)

# Stratified sampling keeps region/product biases in check
candidates = [
    r for r in runs
    if r.metadata.get("intent") in {"quote", "renewal"}
    and r.metadata.get("region") in {"us", "eu"}
]
# Pseudo: stratify so you don't overfit to one slice (intent x region)
buckets = group_by(candidates, key=lambda r: (r.metadata.get("intent"), r.metadata.get("region")))
candidates = sample_evenly(buckets, n_per_bucket=30)  # ~120 total

dataset = langsmith.create_dataset(
    name="golden-licensing-v1",
    schema={
        "input": "string",
        "expected": "object",
        "metadata": "object"
    }
)

for run in candidates:
    user_input = run.inputs.get("query") or run.inputs.get("input") or ""
    user_input = redact_pii(user_input)
    # Seed expected from outputs, then review/edit in LangSmith to make it correct.
    # NOTE: expected values must become ground truth (don’t trust model output).
    dataset.add_example(
        input=user_input,
        expected={
            "sku": run.outputs.get("sku"),
            "term_months": run.outputs.get("term_months"),
            "region": run.outputs.get("region"),
            "compliance_clauses": run.outputs.get("compliance_clauses", []),
            "price_breakdown": run.outputs.get("price_breakdown", {})
        },
        metadata={
            "intent": run.metadata.get("intent"),
            "region": run.metadata.get("region"),
            "product_line": run.metadata.get("product_line"),
            "catalog_version": run.metadata.get("catalog_version"),
            "pricing_version": run.metadata.get("pricing_version"),
            "compliance_version": run.metadata.get("compliance_version")
        }
    )

My quick selection rules:

Include common, boring queries. They are your baseline.
Include "spiky" queries (long, multi-step, or adversarial).
Remove any runs with clear input errors or missing data.
Add negative cases intentionnaly (cross-region requests, expired SKUs, missing customer profiles).
Redact and normalize PII before saving runs (customer names, tenant IDs, emails).

If you have multiple pipelines, create a candidate set per pipeline and then merge for the golden set.

Step 2: Curate for coverage

Curation is what turns data into a signal. A small, well-curated set beats a large, noisy one.

Practical checklist:

Coverage by topic or intent (use tags).
A balance of easy vs hard cases.
Known historical failures (regressions you already saw).

Aim for 50-200 examples for your first golden set. I usually start with 100 because it is small enough to review by hand.

If you can, split your datasets into a "golden" (hand-verified) set and a larger "silver" set (auto-harvested). Run evals on the golden set and use silver to discover new patterns.

Step 3: Add labels and grading criteria

Without labels, you are just replaying inputs. You need at least one of:

Reference outputs (expected answers).
Grading rubrics (criteria for an evaluator).
Metadata needed for a downstream evaluator.

If full reference answers are too expensive, start with a rubric and an LLM-as-judge evaluator. The goal is to detect regressions, not to chase a perfect ground truth. For strict checks, prefer structured expected outputs (JSON targets) so you can validate fields like sku, term_months, and region directly instead of comparing free-form text.

Step 4: Enforce a schema

Schema is how you prevent bad examples from sneaking into your golden set. A minimal schema typically includes:

Input (the user query or input payload).
Reference output (optional but ideal).
Metadata (topic, difficulty, failure mode).

Even if you do nothing else, add a schema to your dataset. It cuts down on flaky evals caused by malformed examples.

Step 5: Tag examples for slicing

Tags let you detect regressions that averages hide. You can slice results by:

Topic or intent.
Difficulty (easy, medium, hard).
Failure mode (hallucination, retrieval miss, formatting).

Keep tags minimal at first: 3-5 tags are enough to start. I once over-tagged a dataset and never used half the tags again.

Step 6: Run evals and compare variants

Now you have a dataset you can trust. Use it for:

Prompt A vs Prompt B comparisons.
Model A vs Model B comparisons.
Release vs previous release regressions.

Keep your experiment design consistent: same dataset version, same evaluator setup, and a fixed number of repetitions if you need variance control.

Step 7: Maintain the golden set

Golden datasets are living artifacts. Plan small, deliberate updates:

Add new user intents when they become important.
Rotate in new edge cases every few weeks.
Keep a core set that never changes.

When you update the dataset, version it and record why it changed. I keep it simple with names like golden-v1, golden-v2, plus a tiny changelog.

If you want a simple diagram, use this flow: Prod traces -> filter and scrub -> label and tag -> golden dataset -> eval runs -> dashboard

Example workflow (pseudo steps)

Export traces for a 7-day window.
Filter by top intents and known failure tags.
Manually review and select 100 examples.
Add reference outputs or grading rubrics.
Create or update a LangSmith dataset with schema and tags.
Run evals for prompt or model changes.
Track metrics over time and log regressions.

Suggested metrics

Pick metrics that match your product. I try to keep it to two or three:

Correctness or rubric score.
Retrieval relevance score (if RAG).
Latency and cost (for operational changes).

Avoid overfitting to a single metric. The golden set is there to catch breakage, not to chase a perfect score.

Proof example (before and after)

In the licensing assistant, the first golden set showed 12% of examples failing schema validation due to missing region metadata, among other issues like term and SKU mismatches. After enforcing schema and adding a default region mapping, invalid rows dropped to 2% and eval reruns stopped flaking.

Common mistakes to avoid

Using only "easy" examples.
Skipping schema validation.
Never refresh the dataset.
Changing the dataset while comparing experiments.
Storing raw PII without redaction or normalization.

Call to action

If you want to start this week, here is a simple checklist:

Pick 100 traces.
Tag them with 3-5 categories.
Add a schema and one evaluator.
Rerun after your next prompt change.

Your future self (and your users) will thank you.

Companion repo coming soon.