You Cannot Ship What You Cannot Measure: Why Agent Evaluation Is the Foundation of Production MLOps

The agent that was getting worse and nobody knew

A team had shipped an LLM-based customer support agent six months earlier. Usage was climbing. Stakeholders liked the engagement metrics. The team was busy adding features.

Three months in, a customer support manager noticed the agent's answers had gotten noticeably less helpful. Not dramatically wrong. Just subtly less precise, occasionally missing context it used to catch.

Asked how they would detect that kind of slow quality slide, the team went quiet. They had A/B testing, uptime monitoring, and latency alerting.

They had no evaluation pipeline.

The degradation had built up over three months of prompt tweaks, model updates, and retrieval config changes, none of it ever checked against a consistent quality benchmark. This is not an unusual story. It is the default for teams that treat evaluation as a launch-day task instead of an ongoing operational discipline.

Why agent evaluation is different from traditional software testing

Non-determinism. The same input can produce different outputs from one call to the next. Evaluation has to measure distributions, not point values.

Silent quality degradation. An LLM agent can get worse without ever throwing an error. A degraded agent still returns 200 OK. The damage shows up only in output quality, and judging that takes a human or an AI assist.

Multi-dimensional quality. A good agent response is correct, relevant, faithful to its source, safe, appropriately confident, and inside its latency budget. Those dimensions move independently. A traditional pass/fail test cannot capture any of that.

The five evaluation dimensions

+-------------------+------------------------------------------+
| DIMENSION         | WHAT IT MEASURES                         |
+-------------------+------------------------------------------+
| Correctness       | Is the answer factually accurate?        |
|                   | (Requires ground truth answers)          |
+-------------------+------------------------------------------+
| Relevance         | Does the answer address the question?    |
|                   | (Can diverge from correctness)           |
+-------------------+------------------------------------------+
| Faithfulness      | Is the answer grounded in the source     |
|                   | context? (Anti-hallucination metric)     |
+-------------------+------------------------------------------+
| Safety            | Does the answer avoid harmful,           |
|                   | biased, or off-policy content?           |
+-------------------+------------------------------------------+
| Latency           | Is response time within SLA?             |
|                   | (P50, P90, P99 percentiles)              |
+-------------------+------------------------------------------+

Step one is deciding which dimensions matter for your use case and weighting them accordingly. A financial document extraction system lives or dies on correctness and faithfulness. A customer support agent cares most about relevance and safety.

The MLOps evaluation pipeline

graph TD
    subgraph DEV["Development Phase"]
        CODE[Code / Prompt Change] --> UNIT_EVAL[Unit Evaluation - sample subset]
        UNIT_EVAL --> PR_GATE{PR Quality Gate}
        PR_GATE -->|Passes| STAGING[Staging Deployment]
        PR_GATE -->|Fails| BLOCK[Blocked - eval report]
    end

    subgraph STAGING_EVAL["Staging Evaluation"]
        STAGING --> FULL_EVAL[Full Evaluation - complete benchmark]
        FULL_EVAL --> COMPARE[Comparison vs. Production Baseline]
        COMPARE --> DEPLOY_GATE{Deployment Gate}
        DEPLOY_GATE -->|Passes| PROD[Production Deployment]
        DEPLOY_GATE -->|Fails| ROLLBACK[Block + Regression Report]
    end

    subgraph PROD_MONITORING["Production Monitoring"]
        PROD --> SAMPLE[5% Random Sample - flagged for eval]
        PROD --> TRIGGER[User Feedback Triggers]
        SAMPLE --> ASYNC_EVAL[Async Evaluation - LLM-as-Judge]
        TRIGGER --> ASYNC_EVAL
        ASYNC_EVAL --> DAILY_REPORT[Daily Quality Report]
        DAILY_REPORT --> ALERT3{Quality Alert}
        ALERT3 -->|Degradation| ENG_ALERT[Engineering Alert]
        ALERT3 -->|Within bounds| TREND[Trend Dashboard]
    end

    subgraph DATASET["Evaluation Dataset Management"]
        ASYNC_EVAL --> CANDIDATE[Candidate Addition - interesting cases]
        CANDIDATE --> ANNOTATE[Human Annotation Queue]
        ANNOTATE --> GOLDEN[Golden Dataset - updated monthly]
        GOLDEN --> FULL_EVAL
        GOLDEN --> UNIT_EVAL
    end

Building the evaluation dataset

The evaluation dataset is the foundation of the whole framework. It is also the most time-consuming part to build, and the part teams skip most often.

For a new system, start with 50 to 100 hand-curated query-response pairs: representative common cases (60%), edge cases (25%), and adversarial cases built to stress-test safety and correctness (15%).

For an existing system with production history, mine the logs. Look for high-confidence responses that users later corrected, queries that triggered escalation, and queries that sit on the edge of what the system can do.

LLM-as-Judge: scaling evaluation without scaling headcount

LLM-as-Judge uses a separate, more capable LLM to score the production LLM's outputs against your criteria. It is what lets evaluation keep up with production volume.

The implementation needs care:

The judge has to be told, explicitly, to evaluate against specific and well-defined criteria
The judge has to be calibrated against human annotations
The judge's verdicts are signals, not ground truth. Middle scores still need a human to look

Calibrate it properly and LLM-as-Judge hits 85 to 90% agreement with human annotation on binary quality judgments. That is enough for production monitoring, where the job is catching a degradation trend, not grading every answer perfectly.

What good evaluation enables

With evaluation in place, prompt changes and model updates can be tested against the benchmark in minutes. The deployment gate lets the team ship a change without sweating its unknown quality impact. And the evaluation data tells you which approach actually moved quality, instead of which one felt better.

Without evaluation, every change is a guess, every deployment is a gamble, and your users find the quality regressions before your engineers do.