Agents on agents: Troubleshooting Agent now diagnoses agentic issues in the MC platform
Table of Contents
AI agents break differently from the way data pipelines do.
A data pipeline fails loudly. A schema changes, a table goes missing, a freshness threshold is breached – and you know about it immediately. The fix is fairly traceable and the impact of it is calculable.
Agents are different. They can fail quietly, ambiguously, and often after the fact. Issues creep up unnoticed for a while, like output quality degrading or latency climbing. And the question you’re left staring at isn’t what happened – you can usually see that in the trace – it’s why. Which prompt change caused the regression? Which code commit introduced the multi-call loop? Which model swap degraded the response quality you were counting on?
Analyzing and identifying the root cause for agentic failures can be a huge task when there’s so much to take into consideration – from the data inputs feeding the models, to the model performance, to agent’s trajectory and decision-making process, to its final outputs.
That’s the gap Monte Carlo is closing today with Troubleshooting Agent, now available for Agent Observability in private release.

Why debugging AI agents is still mostly manual
The teams building and operating AI agents in production are sophisticated. They instrument their pipelines, set up evals, monitor token consumption, and track error rates. But when something anomalous shows up in the metrics, the investigation that follows is still largely manual.
You pull up the trace and compare it against previous traces. You dig through recent PRs. You try to correlate a behavior change to a specific deployment event. It’s exactly the kind of high-context detective work that’s slow, error-prone, and impossible to scale, especially when you’re running multiple agents in production and incidents don’t wait for business hours.
Many observability tools help you visualize what happened, but few – if any – of them diagnose why it happened. That distinction matters enormously when agents are running autonomously and downstream decisions depend on them.
Introducing the Troubleshooting Agent for Agent Observability
Monte Carlo’s Agent Observability platform already gives teams the ability to monitor AI agents end to end — tracking operational metrics like token usage, latency, and error rates alongside output quality through LLM-as-a-judge evaluations. When something goes wrong, Monte Carlo fires an alert.
Today, we’re going a step further.
The Troubleshooting Agent is Monte Carlo’s automated root cause analysis engine for AI agent failures. When you get an alert or spot an errored trace, instead of starting a manual investigation, you click a single button and the agent does the analysis for you.
It traces execution paths through your LLM pipeline, compares the failing trace against historical baselines, surfaces correlating events (code changes, model version swaps, config updates, prompt diffs), and delivers a root cause hypothesis with supporting evidence and a step-by-step verification checklist.
It’s a full-on diagnosis rather than a summary or visualization, saving your teams countless hours and protecting the integrity of your agentic systems.


How it works
The Troubleshooting Agent is available in three places inside Monte Carlo:
On errored traces: n the Traces view for any AI agent, a “Troubleshoot this trace” button is surfaced directly on traces with errors. Click it and the agent analyzes the execution path, identifies what diverged from normal behavior, and surfaces the most likely cause.

On errored conversations: For customer-facing agents where the failure shows up at the conversation level, not just the trace level, the same capability is available in the Conversations view. Investigate a specific user interaction that went wrong without having to manually correlate it back to an underlying trace.

On agent metric and eval alerts: When Monte Carlo fires an alert for anomalous token usage, elevated error rates, or a drop in eval scores, the Troubleshooting Agent surfaces directly on the alert detail page. It analyzes the anomaly, traces it back through recent events, and explains what changed.
Here’s a concrete example of what that looks like in practice:
An alert fires for anomalous completion token usage on a production agent. The Troubleshooting Agent’s output identifies that a recent PR, merged 7 days prior, enabled a feature flag that changed the output format of a specific task, causing average LLM calls per trace to increase from 1.0 to 1.75 and driving a P95 latency spike. It links directly to the PR, explains the mechanism, and provides specific steps to verify the hypothesis. That’s the kind of analysis that would normally take an on-call engineer 30–60 minutes to piece together, while the agent surfaces it in seconds.
Why this is hard, and why we’re the right team to solve it
Troubleshooting AI agent failures is undoubtedly a harder problem than troubleshooting data issues. The signal is noisier, the failure modes are more varied, and the causal chain is harder to trace.
But Monte Carlo has a structural advantage: we’re the only observability platform that sits across the spectrum of the AI stack. We monitor the data and pipelines that feeds your agents and the agents themselves. That means when we’re tracing the root cause of an agent failure, we can ask questions that no standalone trace viewer ever could, like whether the issue originates upstream in the data, in the model, or in the agent logic itself.
This brings to our customers a fundamentally different kind of observability, one that treats data and AI as a cohesive ecosystem, each feeding off of and feeding into the other. And it’s what makes automated root cause analysis for agents tractable.
Private release: get early access
The Troubleshooting Agent is now available as a private release. If you’re running AI agents in production and want it enabled in your Monte Carlo instance, get in touch with us.
Full documentation on Agent Observability and how to get your agents instrumented is available at docs.getmontecarlo.com.
Our promise: we will show you the product.