Skip to content
AI Observability Updated Oct 28 2025

What Is Agent Observability? Key Concepts, Use-Cases, & Vendors

What is agent observability
AUTHOR | Michael Segner

The proliferation of agents is thrilling, and if we are being honest, terrifying. It’s an emerging technology with the power to make or break companies and careers.

At the same time, there is a ton of confusion surrounding new concepts such as agent observability. Is it the same as AI observability? What problem does it solve and how does it work? 

O’Reilly report “Ensuring Data + AI Reliability Through Observability.”

Cutting through the white noise is more difficult than ever as AI slop has outpaced experienced thought leadership.

I wrote this introductory guide, and co-authored the O’Reilly report “Ensuring Data + AI Reliability Through Observability,” with the goal of getting data + AI teams up to speed on the latest best practices and challenges for building reliable agents. Many of the concepts in that report are in this article.

By the end of this article you will understand:

  • Best practices from real data + AI teams.
  • How the agent observability category is defined;
  • The benefits of agent observability;
  • The critical capabilities required for achieving those benefits, and; 
  • Best practices from real data + AI teams.

What is an agent?

Anthropic defines an agent as, “LLMs autonomously using tools in a loop.” 

I’ll expand on that definition a bit. An agent is an AI equipped with a set of guiding principles and resources that is capable of a multi-step decision and action chain to produce a desired outcome. These resources often consist of access to databases, communication tools, or even other sub-agents (if you are using a multi-agent architecture).

What is an agent? A visual guide to the agent lifecycle.
What is an agent? A visual guide to the agent lifecycle.

For example, a customer support agent may:

  • Receive a user inquiry regarding a refund on their last purchase
  • Create and escalate a ticket
  • Access the relevant transaction history in the data warehouse
  • Access the relevant refund policy chunk in a vector database
  • Use the provided context and instructional prompt to formulate a response
  • Reply to the user 

And that would just be step one in the process! The user would reply creating another unique response and series of actions.

What is observability?

Observability is the ability to have visibility into the inputs and outputs of a system as well as the performance of  its component parts. 

An analogy I like to use is a factory that produces widgets. You can test the widgets to make sure they are within spec, but to understand why any deficiencies occurred you also need to monitor the gears that make up the assembly line (and have a process for fixing broken parts). 

Observability is visibility into the outputs as well as into the systems producing those outputs.
The broken boxes represent data products and the gears are the components in a data landscape that introduce reliability issues (data, systems, code).

There are multiple observability categories. The term was first introduced by platforms designed to help software engineers or site reliability engineers reduce the time their applications were offline. These solutions are categorized by Gartner in their Magic Quadrant for Observability Platforms.

Barr Moses and Monte Carlo introduced the data observability category in 2019, later updating the term to data + AI observability.

These platforms  are designed to reduce data downtime and increase the adoption of reliable data + AI. Gartner has produced a Data Observability Market Guide and given the category a benefit rating of HIGH. Gartner also projects 70% of organizations will adopt data observability platforms by 2027, an increase from 50% in 2025.

And amidst these categories you also have agent observability. Let’s define it.

What is agent observability?

If we combine the two definitions–what is an agent and what is observability–together we get the following:

Agent observability is the ability to have visibility into the performance of the inputs, outputs, and component parts of an LLM system that uses tools in a loop.

It’s a critical and fast growing category–Gartner projects that 90% of companies with LLMs in production will adopt these solutions.

Agent observability provides visibility into the agent lifecycle.
Agent observability provides visibility into the agent lifecycle.

Let’s revisit our customer success agent example to further flesh out this definition. 

What was previously an opaque process with a user question, “can I get a refund?” and agent response, “Yes, you are within the 30 day return window, would you like me to email you a return label?” now might look like this:

Sample trace visualized within the Monte Carlo UI.
Sample trace visualized within the Monte Carlo UI.

The above image is a visualized trace, or a record of each span (unit of work) the agent took as part of its session with a user. Many of these spans involve LLM calls. As you can see in the image below, agent observability provides visibility into the telemetry of each span including the prompt (input), completion (output), and operational metrics such as token count (cost), latency, and more.

The telemetry of a span LLM call visualized within the Monte Carlo UI.
The telemetry of a span LLM call visualized within the Monte Carlo UI.

As valuable as this visibility is, what is even more valuable is the ability to set proactive monitors on this telemetry. For example, getting alerted when the relevance of the agent output drops or if the amount of tokens used during a specific span starts to spike.

We’ll dive into more details on common features, how it works, and best practices in subsequent sections, but first let’s make sure we understand the benefits and goals of agent observability.

A quick note on synonymous categories

Terms like GenAI observability, AI observability, or LLM observability are often used interchangeably, although technically the LLM is just one component of an agent. 

RAG (retrieval augmented generation) observability refers to a similar but less narrower pattern involving AI retrieving context to inform its response. I’ve also seen teams reference LLMops, AgentOps, or evaluation platforms.

The labels and technologies have evolved rapidly over a short period of time, but these categorical terms can be considered roughly synonymous. For example, Gartner has produced an “Innovation Insight: LLM Observability” report with essentially the same definition.

Honestly, there is no need to sweat the semantics. Whatever you or your team decide to call it, what’s truly important is that you have the technology and processes in place to monitor and improve the quality and reliability of your agent’s outputs.

Do you need agent observability if you use guardrails?

The short answer is yes. 

Many AI development platforms, such as AWS Bedrock, include real-time safeguards against toxic responses called guardrails. However, guardrails aren’t designed to catch regressions in agent response over time across dimensions like accuracy, helpfulness, or relevance. 

In practice, you need both working together. Guardrails protect you from acute risks in real time, while observability protects you from chronic risks that appear gradually. It’s similar to the relationship between data testing and anomaly detection for monitoring data quality.

Problem to be solved and business benefits

Ultimately the goal of any observability solution is to reduce and minimize downtime

This concept for software applications was popularized by the Google Site Reliability Engineering Handbook, which defined downtime as the portion of unsuccessful requests divided by the total number of requests.

Like everything in the AI space, defining a successful request is more difficult than it seems. After all these are non-deterministic systems meaning you can provide the same input many times and get many different outputs. 

Is a request only unsuccessful if it technically fails? What about if it hallucinates and provides inaccurate information? What if the information is technically correct, but it’s in another language or surrounded by toxic language?

Again it’s best to avoid getting lost in the semantics and pedantics. Ultimately, the goal of reducing downtime is to ensure features are adopted and provide the intended value to users. 

This means agent downtime should be measured based on the underlying use case. For example, clarity and tone of voice might be paramount for our customer success chatbot, but it might not be a large factor for a revenue operations agent providing summarized insights from sales calls. 

This also means your downtime metric should correspond to user adoption. If those numbers don’t track you haven’t captured the key metrics that make your agent valuable. 

Most data + AI teams I talk to today are using adoption as the main proxy for agent reliability. As the space begins to mature, teams are gradually moving toward more forward leading indicators such as downtime and the metrics that roll up to it such as relevancy, latency, recall (F1), and more.

Dropbox, for example, measures agent downtime as:

  • Responses without a citation
  • If more than 95% of responses have a latency greater than 5 seconds
  • If the agent does not reference the right source at least 85% of the time (F1 > 85%)
  • Factual accuracy, clarity, and formatting are other dimensions but a failure threshold isn’t provided.

At Monte Carlo, our development team considers our Troubleshooting Agent as experiencing downtime based on the metrics of semantic distance, groundedness, and proper tool usage. These are evaluated on a 0-1 scale using a LLM-as-judge methodology. Downtime in staging is defined as:

  • Any score under 0.5
  • More than 33% of LLM-as-judge evaluations or more than 2 total evaluations score between a .5 and .8 even after an automatic retry. 
  • Groundedness tests show the agent invents information or answers out of scope (hallucination or missing context).
  • The agent misuses or fails to call required tools 

Outside of adoption, agents can be evaluated across the classic business values of reducing cost, increasing revenue, or decreasing risk. In these scenarios, the cost of downtime can be quantified easily by taking the frequency and duration of downtime and multiplying it by the ROI being driven by the agent. 

This formula remains mostly academic at the moment since, as we’ve noted previously, most teams are not as focused on measuring immediate ROI.

However, I have spoken to a few. One of the clearest examples in this regard is a pharmaceutical company using an agent to enrich customer records in a master data management match merge process. 

They originally built their business case on reducing cost, specifically the number of records that need to be enriched by human stewards. However, while they did increase the number of records that could be automatically enriched, they also improved a large number of poor records that would have been automatically discarded as well! 

So the human steward workload actually increased! Ultimately, this was a good result as record quality improved, however it does underscore how fluid and unpredictable this space remains.

How agent observability works

Agent observability can be built internally by engineering teams or purchased from several vendors (including Monte Carlo). We’ll save the build vs buy analysis for another time, but similar to data testing, some smaller teams will choose to start with an internal build until they reach a scale where a more systemic approach is required.

Whether an internal build or vendor platform, when you boil it down to the essentials, there are really two core components to an agent observability platform: trace visualization and evaluation monitors.

Trace visualization

Traces, or the telemetry data that describes each step taken by an agent, can be captured using an open source SDK that leverages the OpenTelemetry (Otel) framework. Monte Carlo offers one such SDK that can be freely leveraged, with no vendor lock-in.

Teams label key steps, like skills, workflows, or tool calls as spans. When a session starts, the agent calls the SDK which captures all the associated telemetry for each span such as model version, duration, tokens, etc. 

A collector then sends that data to the intended destination (we think the best practice is to consolidate within your warehouse or lakehouse source of truth), where an application can then help visualize the information making it easier to explore. 

visualized trace within an agent observability platform

One benefit to observing agent architectures is that this telemetry is relatively consolidated and easy to access via LLM orchestration frameworks as compared to observing data architectures where critical metadata may be spread across a half dozen systems.

Evaluation monitors

Once you have all of this rich telemetry in place you can monitor or evaluate it. This can be done using an agent observability platform, or sometimes the native capabilities within data + AI platforms. 

Teams will typically refer to the process of using AI to monitor AI (LLM-as-judge) as an evaluation. This type of monitor is well-suited to evaluate the helpfulness, validity, and accuracy of the agent. This is because the outputs are typically larger text fields and non-deterministic, making traditional SQL based monitors less effective across these dimensions.

Agent observability

Where SQL code based monitors really shine however is in detecting issues across operational metrics (system failures, latency, cost, throughput) as well as situations in which the agent’s output must conform to a very specific format or rule. For example, if the output must be in the format of a US postal address or if it must always have a citation.

Most teams will require both types of monitors. In cases where either approach will produce a valid result, teams should favor code-based monitors as they are more deterministic, explainable, and cost effective.

However, it’s important to ensure your heuristic or code based monitor is achieving the intended result. Simple code based monitors focused on use case specific criteria—say output length must be under 350 characters–are typically more effective than complex formulas designed to broadly capture semantic accuracy or validity such as ROUGE, BLEU, cosine similarity, and others. 

While these traditional metrics benefit from being explainable, they struggle when the same idea is expressed in different terms. Almost every data science team starts with these familiar monitors, before quickly abandoning them after a rash of false positives.

What about context engineering and reference data?

This is arguably the third component of agent observability. 

It can be a bit tricky to draw a firm line between data observability and agent observability. In fact, at Monte Carlo we aren’t even trying. We now refer to the data observability category we created as data + AI observability.

This is because agent behavior is driven by the data it retrieves, summarizes, or reasons over. In many cases, the “inputs” that shape an agent’s responses — things like vector embeddings, retrieval pipelines, and structured lookup tables — sit somewhere between the two worlds. Or perhaps it may be more accurate to say they all live in one world, and that agent observability MUST include data observability.

This argument is pretty sound. After all, an agent can’t get the right answer if it’s fed wrong or incomplete context–and in these scenarios agent observability evaluations will still pass with flying colors. 

For those who want to take a deeper, technical dive into observing both the data and AI landscape check out our O’Reilly report “Ensuring Data + AI Reliability Through Observability” for more detail.

Challenges and best practices

It would be easy enough to generate a list of agent observability challenges teams could struggle with, but let’s take a look at the most common problems teams are actually encountering. And remember, these are challenges specifically related to observing agents.

Challenge #1- Evaluation cost

LLM workloads aren’t cheap, and a single agent session can involve hundreds of LLM calls. Now imagine for each of those calls you are also calling another LLM multiple times to judge different quality dimensions. It can add up quick.

One data + Ai leader confessed to us their evaluation cost was 10 times as expensive as the baseline agent workload. Monte Carlo’s agent development team strives to maintain roughly a one to one workload to evaluation ratio. 

Best practices to contain evaluation cost

Most teams will sample a percentage or aggregate number of spans per trace to manage costs while still retaining the ability to detect degradations in performance. Stratified sampling, or sampling a representative portion of the data, can be helpful in this regard, Conversely, it can also be helpful to filter for specific spans such as those with a longer than average duration.

Agent span sampling and filtering

Challenge #2- Defining failure and alert conditions

Even when teams have all the right telemetry and evaluation infrastructure in place, deciding what actually constitutes “failure” turns out to be surprisingly difficult.

To start, defining failure requires being deeply familiar with the agent’s use case and user expectations. A customer support bot, a sales assistant, and a research summarizer all have different standards for what counts as “good enough.” 

What’s more, the relationship between a bad response and its real-world impact on adoption isn’t always linear or obvious. For example, if an evaluation model gives a response that is judged to be a .75 for clarity, is that a failure? 

Best practices for defining failure and alert conditions

Aggregate multiple evaluation dimensions. Rather than declaring a failure based on a single score, combine several key metrics — such as helpfulness, accuracy, faithfulness, and clarity — and treat them as a composite pass/fail test. This is the approach Monte Carlo takes in our agent evaluation framework for our internal agents. 

Most teams will also leverage anomaly detection to identify a consistent drop in scores over a period of time rather than a single (possibly hallucinated) evaluation. Dropbox for example leverages dashboards that track their evaluation score trends over hour, six-hour, and daily intervals.

Finally, know what monitors are “soft” and what monitors are “hard.” There are some monitors that should immediately trigger an alert condition when their threshold is breached. Typically these are more deterministic monitors evaluating an operational metric such as latency or a system failure.

Challenge #3- Flaky evaluations

Who evaluates the evaluators? Using a system that can hallucinate to monitor a system that can hallucinate has obvious drawbacks. 

The other challenge for creating valid evaluations is that, as every single person who has put an agent into production has bemoaned to me, small changes to the prompt have a large impact on the outcome. This means creating customized evaluations or experimenting with evaluations can be difficult.

Best practices for avoiding flaky evaluations:

Most teams avoid flaky tests or evaluations by testing extensively in staging on golden datasets with known input-output pairs. This will typically include representative queries that have proved problematic in the past.

It is also a common practice to test evaluations in production on a small sample of real-world traces with a human in the loop. 

Of course, LLM judges will still occasionally hallucinate. Or as one data scientist put it to me, “one in every ten tests spits out absolute garbage.” He will automatically rerun evaluations for low scores to confirm issues.

Challenge #4- Visibility across the data + AI lifecycle

Of course once a monitor sends an alert the immediate next question is always: “why did that fail?” Getting the answer isn’t easy! Agents are highly complex, interdependent systems. 

Finding the root cause requires end-to-end visibility across the four components that introduce reliability issues into a data + AI system: data, systems, code, and model. Here are some examples:

Data

  • Real world changes and input drift. For example, if a company enters a new market and now there are more users speaking Spanish than English. This could impact the language the model was trained in.
  • Unavailable context. We recently wrote about an issue where the model was working as intended but the context on the root cause (in this case a list of recent pull requests made on table queries) was missing.

System

  • Pipeline or job failures
  • Any change to what tools are provided to the agent or changes in the tools themselves. 
  • Changes to how the agents are orchestrated 

Code

  • Data transformation issues (changing queries, transformation models)
  • Updates to prompts
  • Changes impacting how the output is formatted

Model

  • Platform updates their model version
  • Changes to which model is used for a specific call

Best practices for visibility across the data + AI lifecycle

It is critical to consolidate telemetry from your data + AI systems into a single source of truth, and many teams are choosing the warehouse or lakehouse as their central platform. This unified view lets teams correlate failures across domains — for example, seeing that a model’s relevancy drop coincided with a schema change in an upstream dataset or an updated model. Monte Carlo’s own approach is to consolidate traces, evaluations, and metadata in one place to make cross-component debugging faster and easier.

Representative vendors

  • Monte Carlo
  • Arize
  • Fiddler
  • Braintrust
  • New Relic
  • Datadog

It’s interesting to see how the leading agent observability vendors all started observing other technologies. For example, Monte Carlo created the data observability category while Arize and Fiddler got their start observing machine learning models and New Relic and Datadog focused on software applications.

Deep dive: example architecture

The image above shows the technical architecture Monte Carlo’s Troubleshooting Agent leverages to build a scalable, secure, and decoupled system that connects Monte Carlo’s existing monolithic platform with its new AI Agent stack. 

The image above shows the technical architecture Monte Carlo’s Troubleshooting Agent leverages to build a scalable, secure, and decoupled system that connects Monte Carlo’s existing monolithic platform with its new AI Agent stack. 

On the AI side, the AI Agent Service runs on Amazon ECS Fargate, which enables containerized microservices to scale automatically without managing underlying infrastructure. Incoming traffic to the AI Agent Service is distributed through a network load balancer (NLB), providing high-performance, low-latency routing across Fargate tasks.

The image below is an abstracted interpretation of the Troubleshooting Agent’s workflow, which leverages several specialized sub-agents. These sub-agents investigate different signals to determine the root cause of a data quality incident and report back to the managing agent which presents the findings to the user.

an abstracted interpretation of the Troubleshooting Agent’s workflow, which leverages several specialized sub-agents.

Deliver production-ready agents

I hope you’ve found this guide useful. If I did accomplish my goal to provide insightful and trusted content based on real-life experiences, please feel free to give me a shout on LinkedIn and tell me what you found most helpful or what other questions can be covered.

The core takeaway I hope you walk away with is when your agents enter production and become integral to business operations, the ability to evaluate their reliability becomes a matter of necessity. Production-grade agents must be observed.

If you are interesting in learning more about Monte Carlo’s agent observability solution, we’d love to talk with you. 

Our promise: we will show you the product.