I used to run evals on whatever dataset I had around. It was noisy, and I kept chasing false regressions that were just bad data. The fix was simple: build a small, versioned "golden dataset" from real traces and keep it stable.
A golden dataset is a regression safety net: a stable, representative set of real examples that you rerun after every prompt or model change. If you already collect traces in LangSmith, you can turn those runs into a high-signal dataset in a few steps.
Here's the workflow I actually use. It is not fancy: harvest traces, curate a subset, label it, add schema and tags, then run evals and track changes over time.
In 30 seconds
- Built a small, stable golden eval set from real traces to stop chasing noisy regressions.
- Workflow: harvest -> curate ~100 -> label/rubric -> enforce schema -> tags -> evals -> version updates.
- Repeatable prompt/model comparisons + earlier regression detection (before users complain).
Key takeaways
- Golden set = regression safety net, not a trace dump.
- Schema + tags turn replay into signal.
- Keep it small, stable, and updated deliberately.
Why golden datasets matter
Prompt tweaks and model swaps happen fast, and production behavior drifts faster. Without a stable eval set, you usually learn about regressions from frustrated users. A golden set helps by giving you:
- A repeatable benchmark you trust.
- A baseline for prompt and model comparisons.
- Early regression detection before users complain.
- A small set you can run frequently (daily or in CI).
The key is that the data is real. You are not guessing what users ask. You are measuring it.
Example: licensing assistant in production
Imagine a LangGraph assistant that helps sales reps quote and manage software licenses for Microsoft and Bitdefender, a setup I have worked on. Recommendations start from the product database, but selection and adaptation happen in the LLM/agent. Users ask for bundle recommendations, renewal terms, and region-specific compliance. A wrong SKU or term can kill a deal. These weren’t edge cases — they showed up often enough to matter.
In that setting, a golden dataset built from real traces helps catch regressions like:
- A 2-year renewal is shortened to 1 year.
- A region-restricted SKU is being suggested outside its allowed market.
- A missing compliance clause in the generated quote summary.
What makes a dataset "golden"
A golden dataset is not just a random dump of traces. It has three properties:
- Representative
- It covers your top user intents.
- It includes failure-prone queries and edge cases.
- Labeled
- It has reference answers or scoring rubrics (exact SKUs, correct terms, required clauses).
- It is structured enough for automated evaluators.
- Stable and versioned
- You can rerun it across weeks and compare results.
- You know exactly when and why it changed.
If you only have one dataset in LangSmith, make that one your golden set.
The workflow I use
Step 1: Harvest candidate traces
Start by pulling recent traces from the flows that matter most (your main product paths). Use a fixed time window (for example, the last 7 days) so you are comparing the same traffic conditions over time. I usually stick to a normal week unless there was a big launch or outage. If you need a quick refresher on datasets and runs, the official LangSmith docs are a good reference. This is intentionally non-runnable and focuses on the flow. Pseudo-code example:
# Pseudo-code: harvest traces and create a golden dataset
runs = langsmith.list_runs(project_name="licensing-assistant", since_days=7)
# Stratified sampling keeps region/product biases in check
candidates = [
r for r in runs
if r.metadata.get("intent") in {"quote", "renewal"}
and r.metadata.get("region") in {"us", "eu"}
]
# Pseudo: stratify so you don't overfit to one slice (intent x region)
buckets = group_by(candidates, key=lambda r: (r.metadata.get("intent"), r.metadata.get("region")))
candidates = sample_evenly(buckets, n_per_bucket=30) # ~120 total
dataset = langsmith.create_dataset(
name="golden-licensing-v1",
schema={
"input": "string",
"expected": "object",
"metadata": "object"
}
)
for run in candidates:
user_input = run.inputs.get("query") or run.inputs.get("input") or ""
user_input = redact_pii(user_input)
# Seed expected from outputs, then review/edit in LangSmith to make it correct.
# NOTE: expected values must become ground truth (don’t trust model output).
dataset.add_example(
input=user_input,
expected={
"sku": run.outputs.get("sku"),
"term_months": run.outputs.get("term_months"),
"region": run.outputs.get("region"),
"compliance_clauses": run.outputs.get("compliance_clauses", []),
"price_breakdown": run.outputs.get("price_breakdown", {})
},
metadata={
"intent": run.metadata.get("intent"),
"region": run.metadata.get("region"),
"product_line": run.metadata.get("product_line"),
"catalog_version": run.metadata.get("catalog_version"),
"pricing_version": run.metadata.get("pricing_version"),
"compliance_version": run.metadata.get("compliance_version")
}
)My quick selection rules:
- Include common, boring queries. They are your baseline.
- Include "spiky" queries (long, multi-step, or adversarial).
- Remove any runs with clear input errors or missing data.
- Add negative cases intentionnaly (cross-region requests, expired SKUs, missing customer profiles).
- Redact and normalize PII before saving runs (customer names, tenant IDs, emails).
If you have multiple pipelines, create a candidate set per pipeline and then merge for the golden set.
Step 2: Curate for coverage
Curation is what turns data into a signal. A small, well-curated set beats a large, noisy one.
Practical checklist:
- Coverage by topic or intent (use tags).
- A balance of easy vs hard cases.
- Known historical failures (regressions you already saw).
Aim for 50-200 examples for your first golden set. I usually start with 100 because it is small enough to review by hand.
If you can, split your datasets into a "golden" (hand-verified) set and a larger "silver" set (auto-harvested). Run evals on the golden set and use silver to discover new patterns.
Step 3: Add labels and grading criteria
Without labels, you are just replaying inputs. You need at least one of:
- Reference outputs (expected answers).
- Grading rubrics (criteria for an evaluator).
- Metadata needed for a downstream evaluator.
If full reference answers are too expensive, start with a rubric and an LLM-as-judge evaluator. The goal is to detect regressions, not to chase a perfect ground truth. For strict checks, prefer structured expected outputs (JSON targets) so you can validate fields like sku, term_months, and region directly instead of comparing free-form text.
Step 4: Enforce a schema
Schema is how you prevent bad examples from sneaking into your golden set. A minimal schema typically includes:
- Input (the user query or input payload).
- Reference output (optional but ideal).
- Metadata (topic, difficulty, failure mode).
Even if you do nothing else, add a schema to your dataset. It cuts down on flaky evals caused by malformed examples.
Step 5: Tag examples for slicing
Tags let you detect regressions that averages hide. You can slice results by:
- Topic or intent.
- Difficulty (easy, medium, hard).
- Failure mode (hallucination, retrieval miss, formatting).
Keep tags minimal at first: 3-5 tags are enough to start. I once over-tagged a dataset and never used half the tags again.
Step 6: Run evals and compare variants
Now you have a dataset you can trust. Use it for:
- Prompt A vs Prompt B comparisons.
- Model A vs Model B comparisons.
- Release vs previous release regressions.
Keep your experiment design consistent: same dataset version, same evaluator setup, and a fixed number of repetitions if you need variance control.
Step 7: Maintain the golden set
Golden datasets are living artifacts. Plan small, deliberate updates:
- Add new user intents when they become important.
- Rotate in new edge cases every few weeks.
- Keep a core set that never changes.
When you update the dataset, version it and record why it changed. I keep it simple with names like golden-v1, golden-v2, plus a tiny changelog.
If you want a simple diagram, use this flow: Prod traces -> filter and scrub -> label and tag -> golden dataset -> eval runs -> dashboard
Example workflow (pseudo steps)
- Export traces for a 7-day window.
- Filter by top intents and known failure tags.
- Manually review and select 100 examples.
- Add reference outputs or grading rubrics.
- Create or update a LangSmith dataset with schema and tags.
- Run evals for prompt or model changes.
- Track metrics over time and log regressions.
Suggested metrics
Pick metrics that match your product. I try to keep it to two or three:
- Correctness or rubric score.
- Retrieval relevance score (if RAG).
- Latency and cost (for operational changes).
Avoid overfitting to a single metric. The golden set is there to catch breakage, not to chase a perfect score.
Proof example (before and after)
In the licensing assistant, the first golden set showed 12% of examples failing schema validation due to missing region metadata, among other issues like term and SKU mismatches. After enforcing schema and adding a default region mapping, invalid rows dropped to 2% and eval reruns stopped flaking.
Common mistakes to avoid
- Using only "easy" examples.
- Skipping schema validation.
- Never refresh the dataset.
- Changing the dataset while comparing experiments.
- Storing raw PII without redaction or normalization.
Call to action
If you want to start this week, here is a simple checklist:
- Pick 100 traces.
- Tag them with 3-5 categories.
- Add a schema and one evaluator.
- Rerun after your next prompt change.
Your future self (and your users) will thank you.
Companion repo coming soon.