← Back to posts

Why Averages Lie: Using Tags to Expose Hidden LLM Regressions

Averages hide real problems. I learned that the hard way: a bundle recommender looked stable overall, but a single region-specific slice was failing badly. Tags turned that hidden failure into a visible signal. After tracing individual failures to root cause, I realized the bigger problem was that our aggregate metrics were hiding where users were actually failing.

This post shows the tagging approach I use in the offer-bundling project: small taxonomy, consistent naming, and slices that map to real user risk. This pattern shows up in any evaluation system where aggregate scores hide localized failures.

In 30 seconds

  • Tag examples and runs so you can slice evals by topic and risk.
  • Keep the taxonomy small: 3 to 5 tags are enough to start.
  • Compare model or prompt variants on the same tagged slices.

Key takeaways

  • Tags turn averages into actionable insights.
  • Consistent naming saves hours later.
  • Maintain a core tagged set for fast regression checks.

Why tags matter

If you only look at overall scores, you miss where users actually struggle. Tags let you answer:

  • Does the EU region fail more often?
  • Are multi-step queries worse than single-step?
  • Do compliance clauses drop under certain prompts?

A simple tagging taxonomy

Start with three dimensions:

  • topic: quote, renewal, bundle
  • difficulty: easy, medium, hard
  • failure-mode: missing-clause, wrong-sku, over-budget

Keep it small. You can add more later.

Where to attach tags

In LangSmith, tags can live on:

  • dataset examples
  • runs
  • eval runs

In the offer-bundling demo, I tag examples by region and intent, and I tag runs by prompt version.

Using tags in practice

Once tags exist, you can:

  • filter dashboards by tag
  • compare prompt versions on the same slice
  • isolate regressions hidden by averages

Example: the eu + missing-clause slice caught a GDPR failure that was invisible in the global score. Left unnoticed, these slice-level failures would have shipped incorrect or non-compliant recommendations.

Building tagged sets from traces

When you harvest traces for a golden dataset:

  1. auto-tag by metadata (region, intent)
  2. manually review the high-risk cases

This keeps the golden set small and useful.

Maintain a core tagged set

I keep 30 to 50 examples tagged by the highest-risk categories and run them on every prompt change. This is the fastest regression check I have found.

Closing thought

Tags are a small habit that creates big clarity. If you pick just three tags and apply them consistently, you will find regressions that averages never show.


Profile picture

Written by Florin — full-stack & AI engineer.