Averages hide real problems. I learned that the hard way: a bundle recommender looked stable overall, but a single region-specific slice was failing badly. Tags turned that hidden failure into a visible signal. After tracing individual failures to root cause, I realized the bigger problem was that our aggregate metrics were hiding where users were actually failing.
This post shows the tagging approach I use in the offer-bundling project: small taxonomy, consistent naming, and slices that map to real user risk. This pattern shows up in any evaluation system where aggregate scores hide localized failures.
In 30 seconds
- Tag examples and runs so you can slice evals by topic and risk.
- Keep the taxonomy small: 3 to 5 tags are enough to start.
- Compare model or prompt variants on the same tagged slices.
Key takeaways
- Tags turn averages into actionable insights.
- Consistent naming saves hours later.
- Maintain a core tagged set for fast regression checks.
Why tags matter
If you only look at overall scores, you miss where users actually struggle. Tags let you answer:
- Does the EU region fail more often?
- Are multi-step queries worse than single-step?
- Do compliance clauses drop under certain prompts?
A simple tagging taxonomy
Start with three dimensions:
- topic:
quote,renewal,bundle - difficulty:
easy,medium,hard - failure-mode:
missing-clause,wrong-sku,over-budget
Keep it small. You can add more later.
Where to attach tags
In LangSmith, tags can live on:
- dataset examples
- runs
- eval runs
In the offer-bundling demo, I tag examples by region and intent, and I tag runs by prompt version.
Using tags in practice
Once tags exist, you can:
- filter dashboards by tag
- compare prompt versions on the same slice
- isolate regressions hidden by averages
Example: the eu + missing-clause slice caught a GDPR failure that was invisible in the global score.
Left unnoticed, these slice-level failures would have shipped incorrect or non-compliant recommendations.
Building tagged sets from traces
When you harvest traces for a golden dataset:
- auto-tag by metadata (region, intent)
- manually review the high-risk cases
This keeps the golden set small and useful.
Maintain a core tagged set
I keep 30 to 50 examples tagged by the highest-risk categories and run them on every prompt change. This is the fastest regression check I have found.
Closing thought
Tags are a small habit that creates big clarity. If you pick just three tags and apply them consistently, you will find regressions that averages never show.