Why You Can’t Answer “How Reliable Is This Agent?”
Table of Contents
This seemingly basic question prevents wider roll-out and adoption. Here’s how teams are solving for it.
The never-ending pilot
We recently spoke with a retail organization that pushed a customer-facing AI agent into production — quite the accomplishment!
It’s a customer support agent that can be accessed on their website…well at least a small corner of their website. The problem is the organization doesn’t have the conviction or trust to make it more widely accessible.
And after reading headlines about cars being sold for $1 and 18,000 waters being ordered at the drive-thru, I don’t blame them.
I got the sense the development team also didn’t have full conviction in their agent, or at least couldn’t demonstrate reliable performance in production.
And so the agent continues to languish in its tiny corner of website purgatory. Dreams and aspirations of ROI dashed by corporate inertia and a lack of trust.

This rags to rags story is an all to common story when we speak with data + AI teams on their agent initiatives. Let’s dive into why this happens and some strategies to cross the AI trust chasm.
It’s harder to evaluate agents in prod than in dev

I’m not saying evaluating agent reliability in development is easy, but I’m not not saying it either.
In development you can create thoughtful representative scenarios and golden datasets. The common inputs with the corresponding expected output. It’s almost like cheating on a test, you have all the answers.
You run the agent (or at least the specific workflow you are testing) and manually grade the outcome using your human judgement. You can also measure each component of the process using all types of algorithms from F1 for retrieval to cosine similarity for semantic distance.
But it’s harder to measure recall at scale in a programmatic way when you are in production and inputs are unable to be fully anticipated and outputs are non-deterministic.
Without robust observability —traces and evaluation frameworks — it’s almost impossible to know when or why things go wrong.
You’re no longer testing whether the agent can do the job, but how often it does an acceptable job in a variety of real conditions. There is a reason an MIT study found 95% of AI initiatives fail within 12 months.
The northstar metric for AI reliability is downtime
Ultimately, the northstar metric to measure the reliability of systems is uptime (or its inverse downtime).
This concept for software applications was popularized by the Google Site Reliability Engineering Handbook, which defined downtime as the portion of unsuccessful requests divided by the total number of requests.

This concept has proven to be quite flexible across disciplines as data engineers have found success by defining SLAs for their data products and working to minimize data downtime.
“With these data SLAs in place, I created a dashboard by business and by warehouse to understand what percentage of SLAs were being met on a daily basis. As someone with a data background, I want to know what contracts we have made as a team and where we are falling down on the job.” -Brandon Bidel, Director of Data Science, Red Ventures
There are many reasons it is such an effective metric, but perhaps the most important is that it moves teams from a reactive to proactive approach.
Without defined SLAs or downtime goals, engineering teams will persist in a perpetual cycle of break-fix firedrills seemingly at the mercy of the fickle finger of fate. The tail wags the dog as the volume of the ticket submitter dictates prioritization rather than the impact to the business.
Seems easy enough, but there are some challenges that must be overcome.
How to define AI downtime
Like everything in the AI space, defining a SLA or even a successful request is more difficult than it seems. After all, these are non-deterministic systems meaning you can provide the same input many times and get many different outputs.
Is a request only unsuccessful if it technically fails? What about if it hallucinates and provides inaccurate information? What if the information is technically correct, but it’s in another language or surrounded by toxic language?
The key is to not lose the forest in the trees. Ultimately, the goal of reducing downtime is to ensure features are adopted and provide the intended value to users.
This means agent downtime should be measured based on the underlying use case. For example, clarity and tone of voice might be paramount for a customer success chatbot, but it might not be a large factor for a revenue operations agent providing summarized insights from sales calls.
Dropbox, for example, measures agent downtime as:
- Responses without a citation
- If more than 95% of responses have a latency greater than 5 seconds
- If the agent does not reference the right source at least 85% of the time (F1 > 85%)
- Factual accuracy, clarity, and formatting are other dimensions but a failure threshold isn’t provided.
At Monte Carlo, our development team considers our Troubleshooting Agent as experiencing downtime based on the metrics of semantic distance, groundedness, and proper tool usage. These are evaluated on a 0–1 scale using a LLM-as-judge methodology. Downtime in staging is defined as:
- Any score under 0.5
- More than 33% of LLM-as-judge evaluations or more than 2 total evaluations score between a .5 and .8 even after an automatic retry.
- Groundedness tests show the agent invents information or answers out of scope (hallucination or missing context).
- The agent misuses or fails to call required tools
For a revenue operations technology platform, the team defines downtime relatively simply as time to first token for p90 < 5 seconds. For LinkedIn and their Hiring Assistant, they also focused on throughput and latency, specifically metrics like queries per second (QPS) and p90 end-to-end latency. They also mention not sacrificing quality of the output, but don’t mention the associated metrics for how they measure it.
While evaluation criteria is often highly customized based on the use case, there are some common dimensions. These include:
- Answer relevance: Did the agent provide a relevant response based on the user inquiry?
- Helpfulness: How useful and informative is the output?
- Clarity: How clear and understandable is the response given the input context?
- Task completion: Did the output successfully complete the requested task?
- Prompt adherence: Did the agent comply with the specific instructions?
- Language match: Did the output language match the input language?
For real evaluation templates you can lift and use today, check out our blog on LLM-as-judge evaluations.
Other common downtime metrics include operational metrics like token usage and latency as well as more deterministic monitors for “hard failures” or as Dropbox calls them “boolean gates.” Common deterministic monitors include:
- Output length
- Format (address, JSON, etc)
- Banned words
Adoption matters too, it’s just a lagging indicator
Today, most data + AI teams I talk to are using adoption as the main proxy for agent reliability. One start up I spoke with monitors usage and latency by user by account on a dashboard they compulsively check every day.
This works to an extent. Downtime should correspond to user adoption. If you have high adoption and high downtime, you haven’t captured the key metrics that make your agent valuable.
Just keep in mind adoption is a lagging indicator and not necessarily one that will help direct your reliability engineering efforts. Ideally, you want to catch and fix the reliability of your agent, before your users walk out the door.
Calculating the ROI of agent reliability
Agent ROI is often quantifiable across the classic business values of reducing cost, increasing revenue, or decreasing risk. In these scenarios, the cost of downtime can be quantified easily by taking the frequency and duration of downtime and multiplying it by the ROI being driven by the agent.
This formula remains mostly academic at the moment since, as we’ve noted previously, most teams are less focused on immediate ROI and more focused on advancing their AI capabilities.
However, I have spoken to a few teams working to document the ROI of their agents. One of the clearest examples in this regard is a pharmaceutical company using an agent to enrich customer records in a master data management match merge process.
They originally built their business case on reducing cost, specifically the number of records that need to be enriched by human stewards. However, while they did increase the number of records that could be automatically enriched, they also improved a large number of poor records that would have been automatically discarded as well.
So the human steward workload actually increased! Ultimately, this was a good result as record quality improved, however it does underscore how fluid and unpredictable this space remains.
Building Conviction Through Reliability
Too many AI agents live out their days without reaching their full potential. Copilots stuck in never-ending pilots.
The problem is part technical, part organizational.
There are real reliability issues that must be addressed as infrastructure, supporting technologies and best practices form. But, trust ultimately is a human emotion. Your stakeholders and your team need to see your agent perform reliability with their own eyes.
More often than not, I find that teams that have moved from cautious experimentation to cautiously optimistic deployment have defined what good looks like. They have the ability to see when their agents don’t meet their expectations and actively seek to minimize downtime.
In a world where everyone has access to powerful models, teams that can consistently prove that their agent actually works as intended will scale their way to success.
Our promise: we will show you the product.