← Back to posts

Building Golden Evaluation Datasets from Production Traces

I used to run evals on whatever dataset I had around. It was noisy, and I kept chasing false regressions that were just bad data. The fix was simple: build a small, versioned "golden dataset" from real traces and keep it stable.

A golden dataset is a regression safety net: a stable, representative set of real examples that you rerun after every prompt or model change. If you already collect traces in LangSmith, you can turn those runs into a high-signal dataset in a few steps.

Here's the workflow I actually use. It is not fancy: harvest traces, curate a subset, label it, add schema and tags, then run evals and track changes over time.

In 30 seconds

  • Built a small, stable golden eval set from real traces to stop chasing noisy regressions.
  • Workflow: harvest -> curate ~100 -> label/rubric -> enforce schema -> tags -> evals -> version updates.
  • Repeatable prompt/model comparisons + earlier regression detection (before users complain).

Key takeaways

  • Golden set = regression safety net, not a trace dump.
  • Schema + tags turn replay into signal.
  • Keep it small, stable, and updated deliberately.

Why golden datasets matter

Prompt tweaks and model swaps happen fast, and production behavior drifts faster. Without a stable eval set, you usually learn about regressions from frustrated users. A golden set helps by giving you:

  • A repeatable benchmark you trust.
  • A baseline for prompt and model comparisons.
  • Early regression detection before users complain.
  • A small set you can run frequently (daily or in CI).

The key is that the data is real. You are not guessing what users ask. You are measuring it.

Example: licensing assistant in production

Imagine a LangGraph assistant that helps sales reps quote and manage software licenses for Microsoft and Bitdefender, a setup I have worked on. Recommendations start from the product database, but selection and adaptation happen in the LLM/agent. Users ask for bundle recommendations, renewal terms, and region-specific compliance. A wrong SKU or term can kill a deal. These weren’t edge cases — they showed up often enough to matter.

In that setting, a golden dataset built from real traces helps catch regressions like:

  • A 2-year renewal is shortened to 1 year.
  • A region-restricted SKU is being suggested outside its allowed market.
  • A missing compliance clause in the generated quote summary.

What makes a dataset "golden"

A golden dataset is not just a random dump of traces. It has three properties:

  1. Representative
  • It covers your top user intents.
  • It includes failure-prone queries and edge cases.
  1. Labeled
  • It has reference answers or scoring rubrics (exact SKUs, correct terms, required clauses).
  • It is structured enough for automated evaluators.
  1. Stable and versioned
  • You can rerun it across weeks and compare results.
  • You know exactly when and why it changed.

If you only have one dataset in LangSmith, make that one your golden set.

The workflow I use

Step 1: Harvest candidate traces

Start by pulling recent traces from the flows that matter most (your main product paths). Use a fixed time window (for example, the last 7 days) so you are comparing the same traffic conditions over time. I usually stick to a normal week unless there was a big launch or outage. If you need a quick refresher on datasets and runs, the official LangSmith docs are a good reference. This is intentionally non-runnable and focuses on the flow. Pseudo-code example:

# Pseudo-code: harvest traces and create a golden dataset
runs = langsmith.list_runs(project_name="licensing-assistant", since_days=7)

# Stratified sampling keeps region/product biases in check
candidates = [
    r for r in runs
    if r.metadata.get("intent") in {"quote", "renewal"}
    and r.metadata.get("region") in {"us", "eu"}
]
# Pseudo: stratify so you don't overfit to one slice (intent x region)
buckets = group_by(candidates, key=lambda r: (r.metadata.get("intent"), r.metadata.get("region")))
candidates = sample_evenly(buckets, n_per_bucket=30)  # ~120 total

dataset = langsmith.create_dataset(
    name="golden-licensing-v1",
    schema={
        "input": "string",
        "expected": "object",
        "metadata": "object"
    }
)

for run in candidates:
    user_input = run.inputs.get("query") or run.inputs.get("input") or ""
    user_input = redact_pii(user_input)
    # Seed expected from outputs, then review/edit in LangSmith to make it correct.
    # NOTE: expected values must become ground truth (don’t trust model output).
    dataset.add_example(
        input=user_input,
        expected={
            "sku": run.outputs.get("sku"),
            "term_months": run.outputs.get("term_months"),
            "region": run.outputs.get("region"),
            "compliance_clauses": run.outputs.get("compliance_clauses", []),
            "price_breakdown": run.outputs.get("price_breakdown", {})
        },
        metadata={
            "intent": run.metadata.get("intent"),
            "region": run.metadata.get("region"),
            "product_line": run.metadata.get("product_line"),
            "catalog_version": run.metadata.get("catalog_version"),
            "pricing_version": run.metadata.get("pricing_version"),
            "compliance_version": run.metadata.get("compliance_version")
        }
    )

My quick selection rules:

  • Include common, boring queries. They are your baseline.
  • Include "spiky" queries (long, multi-step, or adversarial).
  • Remove any runs with clear input errors or missing data.
  • Add negative cases intentionnaly (cross-region requests, expired SKUs, missing customer profiles).
  • Redact and normalize PII before saving runs (customer names, tenant IDs, emails).

If you have multiple pipelines, create a candidate set per pipeline and then merge for the golden set.

Step 2: Curate for coverage

Curation is what turns data into a signal. A small, well-curated set beats a large, noisy one.

Practical checklist:

  • Coverage by topic or intent (use tags).
  • A balance of easy vs hard cases.
  • Known historical failures (regressions you already saw).

Aim for 50-200 examples for your first golden set. I usually start with 100 because it is small enough to review by hand.

If you can, split your datasets into a "golden" (hand-verified) set and a larger "silver" set (auto-harvested). Run evals on the golden set and use silver to discover new patterns.

Step 3: Add labels and grading criteria

Without labels, you are just replaying inputs. You need at least one of:

  • Reference outputs (expected answers).
  • Grading rubrics (criteria for an evaluator).
  • Metadata needed for a downstream evaluator.

If full reference answers are too expensive, start with a rubric and an LLM-as-judge evaluator. The goal is to detect regressions, not to chase a perfect ground truth. For strict checks, prefer structured expected outputs (JSON targets) so you can validate fields like sku, term_months, and region directly instead of comparing free-form text.

Step 4: Enforce a schema

Schema is how you prevent bad examples from sneaking into your golden set. A minimal schema typically includes:

  • Input (the user query or input payload).
  • Reference output (optional but ideal).
  • Metadata (topic, difficulty, failure mode).

Even if you do nothing else, add a schema to your dataset. It cuts down on flaky evals caused by malformed examples.

Step 5: Tag examples for slicing

Tags let you detect regressions that averages hide. You can slice results by:

  • Topic or intent.
  • Difficulty (easy, medium, hard).
  • Failure mode (hallucination, retrieval miss, formatting).

Keep tags minimal at first: 3-5 tags are enough to start. I once over-tagged a dataset and never used half the tags again.

Step 6: Run evals and compare variants

Now you have a dataset you can trust. Use it for:

  • Prompt A vs Prompt B comparisons.
  • Model A vs Model B comparisons.
  • Release vs previous release regressions.

Keep your experiment design consistent: same dataset version, same evaluator setup, and a fixed number of repetitions if you need variance control.

Step 7: Maintain the golden set

Golden datasets are living artifacts. Plan small, deliberate updates:

  • Add new user intents when they become important.
  • Rotate in new edge cases every few weeks.
  • Keep a core set that never changes.

When you update the dataset, version it and record why it changed. I keep it simple with names like golden-v1, golden-v2, plus a tiny changelog.

If you want a simple diagram, use this flow: Prod traces -> filter and scrub -> label and tag -> golden dataset -> eval runs -> dashboard

Example workflow (pseudo steps)

  1. Export traces for a 7-day window.
  2. Filter by top intents and known failure tags.
  3. Manually review and select 100 examples.
  4. Add reference outputs or grading rubrics.
  5. Create or update a LangSmith dataset with schema and tags.
  6. Run evals for prompt or model changes.
  7. Track metrics over time and log regressions.

Suggested metrics

Pick metrics that match your product. I try to keep it to two or three:

  • Correctness or rubric score.
  • Retrieval relevance score (if RAG).
  • Latency and cost (for operational changes).

Avoid overfitting to a single metric. The golden set is there to catch breakage, not to chase a perfect score.

Proof example (before and after)

In the licensing assistant, the first golden set showed 12% of examples failing schema validation due to missing region metadata, among other issues like term and SKU mismatches. After enforcing schema and adding a default region mapping, invalid rows dropped to 2% and eval reruns stopped flaking.

Common mistakes to avoid

  • Using only "easy" examples.
  • Skipping schema validation.
  • Never refresh the dataset.
  • Changing the dataset while comparing experiments.
  • Storing raw PII without redaction or normalization.

Call to action

If you want to start this week, here is a simple checklist:

  • Pick 100 traces.
  • Tag them with 3-5 categories.
  • Add a schema and one evaluator.
  • Rerun after your next prompt change.

Your future self (and your users) will thank you.

Companion repo coming soon.


Profile picture

Written by Florin Tech snippets for everybody