Schema-Driven Validation for Stable LLM Evaluations

September 11, 20254 min read

My earliest eval runs looked fine until they didn't: a single malformed row could flip a whole report. That was on me. The fix was simple: put schemas in front of the dataset and model outputs so bad examples never enter the pipeline.

A schema turns a noisy eval loop into a stable one. It blocks malformed rows at ingest, prevents shape drift as prompts evolve, and gives you consistent fields to compare over time. If you run evals in LangSmith, schema validation is the fastest way to make your signals trustworthy.

Here is the workflow I actually use. It is boring, but it works: define the minimum schema, enforce it at ingest, validate outputs in your chain, and quarantine failures.

In 30 seconds

Schemas stop flaky evals by blocking malformed examples and output drift.
Enforce a minimal dataset schema, then validate outputs in your chain/agent.
Track fewer invalid rows, higher rerun success, and lower variance.

Key takeaways

Schema-first datasets prevent regressions caused by bad data.
Output validation closes the loop and keeps evals honest.
Quarantine failures instead of polluting the golden set.

Why schema-driven validation matters

Prompt tweaks and model swaps are unavoidable. The hidden issue is shape drift: fields disappear, types change, and defaults leak in. A single malformed row can blow up an eval run or, worse, skew your metrics. Schemas protect you from both.

A schema gives you:

Consistent inputs and expected outputs.
Clear failures when data is invalid.
Stable metrics you can compare week to week.

Schemas are to evals what TypeScript is to JavaScript: they prevent silent breakage.

The win is not perfection. It is fewer false alarms and more confidence in the regressions you do see.

What to include in a schema

Keep it minimal at first. Start with the fields you need to score and compare.

Minimum schema:

input (string)
expected_output (object or JSON)
metadata (object)

Optional but useful:

Rubrics or grading criteria
Reference outputs for strict checks
Failure tags for slicing (hallucination, retrieval miss, format error)

The mistake I see most often: overengineering the schema on day one. Start small and add fields only when you need them.

Example: quote assistant with compliance rules

I worked on a quoting assistant for enterprise software. The agent had to output a structured quote summary: SKU, term length, region, compliance clauses, and a price breakdown. We ran evals on 200 examples every week.

The problem: small prompt changes would introduce missing fields or wrong types. A compliance_clauses array would come back as a string. A term_months integer would appear as "24 months". Evals failed for the wrong reason.

Once we added schemas at ingest and output validation in the chain. Invalid rows dropped dramatically (from the high teens to low single digits). The only failures left were real regressions.

Step 1: Define the dataset schema

In LangSmith, add the schema to your dataset so invalid examples are rejected at ingest. You can do this in the UI or the SDK. This is intentionally non-runnable and focuses on the flow.

# Pseudo-code: define schema and enforce at dataset creation
schema = {
    "input": "string",
    "expected_output": {
        "sku": "string",
        "term_months": "integer",
        "region": "string",
        "compliance_clauses": "array"
    },
    "metadata": {
        "intent": "string",
        "catalog_version": "string"
    }
}

dataset = langsmith.create_dataset(
    name="golden-quotes-v1",
    schema=schema
)

If a row does not match the schema, it never enters your golden set.

Step 2: Validate inputs before import

If you bulk import from traces or CSVs, validate before you push. Do not rely on human cleanup after the fact.

# Pseudo-code: reject malformed rows before upload
for row in rows:
    try:
        validate_schema(row, schema)
        dataset.add_example(**row)
    except ValidationError as e:
        quarantine.add_example(row, error=str(e))

This lets you keep the golden set clean while still keeping the bad rows for analysis.

Step 3: Validate model outputs in the chain

Even with a clean dataset, outputs can drift. Add an output schema in your chain so you catch invalid responses early.

# Pseudo-code: output validation with Pydantic
class QuoteOutput(BaseModel):
    sku: str
    term_months: int
    region: str
    compliance_clauses: list[str]

raw = chain.invoke(inputs)
validated = QuoteOutput.model_validate(raw)

If validation fails, log it and move on. Do not let a bad output pollute your eval results.

Step 4: Quarantine failures

Create a "quarantine" dataset for rows or outputs that fail validation. I treat it like a backlog:

Inspect failures weekly.
Fix bad examples or adjust the schema deliberately.
Promote only corrected rows into the golden set.

This keeps your regression set stable and your debugging focused.

Step 5: Measure stability

If schema validation is working, you should see:

Fewer invalid rows at ingest.
Higher eval rerun success rates.
Lower run-to-run variance.

In my case, the metrics were obvious. We went from weekly reruns that failed half the time to a stable eval loop that only failed when the model actually regressed.

A simple diagram to include

Ingest traces → schema validation → dataset → eval run → output validation → results + quarantine

Closing thought

Schema-driven validation is not glamorous, but it is the easiest way to stop chasing false regressions. If you already use LangSmith, add a minimal schema to your main dataset, enforce it at ingest, and validate outputs in your chain. Your evals will finally tell the truth. Once schemas were in place, debugging stopped feeling chaotic.

If you want a starter schema template or validation helpers, I can share a lightweight snippet or repo.