← Back to posts

Why Averages Lie: Using Tags to Expose Hidden LLM Regressions

Averages hide real problems. I learned that the hard way: a bundle recommender looked stable overall, but a single region-specific slice was failing badly. Tags turned that hidden failure into a visible signal. After tracing individual failures to root cause, I realized the bigger problem was that our aggregate metrics were hiding where users were actually failing.

This post shows the tagging approach I use in the offer-bundling project: small taxonomy, consistent naming, and slices that map to real user risk. This pattern shows up in any evaluation system where aggregate scores hide localized failures.

In 30 seconds

  • Tag examples and runs so you can slice evals by topic and risk.
  • Keep the taxonomy small: 3 to 5 tags are enough to start.
  • Compare model or prompt variants on the same tagged slices.

Key takeaways

  • Tags turn averages into actionable insights.
  • Consistent naming saves hours later.
  • Maintain a core tagged set for fast regression checks.

Why tags matter

If you only look at overall scores, you miss where users actually struggle. Tags let you answer:

  • Does the EU region fail more often?
  • Are multi-step queries worse than single-step?
  • Do compliance clauses drop under certain prompts?

A simple tagging taxonomy

Start with three dimensions:

  • topic: quote, renewal, bundle
  • difficulty: easy, medium, hard
  • failure-mode: missing-clause, wrong-sku, over-budget

Keep it small. You can add more later.

Where to attach tags

In LangSmith, tags can live on:

  • dataset examples
  • runs
  • eval runs

In the offer-bundling demo, I tag examples by region and intent, and I tag runs by prompt version.

Using tags in practice

Once tags exist, you can:

  • filter dashboards by tag
  • compare prompt versions on the same slice
  • isolate regressions hidden by averages

Example: the eu + missing-clause slice caught a GDPR failure that was invisible in the global score. Left unnoticed, these slice-level failures would have shipped incorrect or non-compliant recommendations.

Building tagged sets from traces

When you harvest traces for a golden dataset:

  1. auto-tag by metadata (region, intent)
  2. manually review the high-risk cases

This keeps the golden set small and useful.

Maintain a core tagged set

I keep 30 to 50 examples tagged by the highest-risk categories and run them on every prompt change. This is the fastest regression check I have found.

Closing thought

Tags are a small habit that creates big clarity. If you pick just three tags and apply them consistently, you will find regressions that averages never show.


Profile picture

Written by Florin Tech snippets for everybody