Skip to content
Data Observability Updated Apr 29 2026

Monitoring Cortex Agent Performance With Trace Data

AUTHORS | Michael Segner | Virna Sekuj

Observing your Cortex Agent fleet

Teams building with Snowflake Intelligence tend to move fast. Cortex Agents can be built quickly and seamlessly, and before you know it, you have a fleet running in production across different domains and business functions.

Snowflake Intelligence architecture. Source: Snowflake.

Deploying agents is an empowering experience. Understanding and evaluating their performance, however, can be a different story. Monitoring for healthy functionality like error rates, latency, token consumption, as well as quality indicators like freshness and volume, is a baseline requirement for any production system; agents are no exception.

Snowflake Intelligence provides the foundation teams need to answer these important questions. It logs rich trace data natively — conversation history, tool execution, LLM planning, response generation.

The challenge is surfacing it in a way that’s operationally useful without writing SQL against observability tables every time.

Think of it as a maturation curve.

Initially, you might start with just getting moment-in-time snapshots by querying agent error rates or other performance metrics. As you advance, you start building and maintaining dashboards, and then monitoring those dashboard trends with anomaly detection models to identify outliers and SLA breaches. Eventually, you can actively identify systemic issues driving “edge cases” and implement an incident management process to maintain production-grade SLAs.

Graduating through this maturation curve requires a strong understanding of agent telemetry and how it can be observed, analyzed, and ultimately used to troubleshoot agentic failures.

We will explore these topics in this post, walking through how Snowflake logs Cortex Agent telemetry as well as how agent observability solutions can help monitor and fix common performance failure patterns. Other critical aspects of agent reliability including context (data quality), trajectories (tool calls), and outputs are for a later day.

What Snowflake logs for Cortex Agents: a look under the hood

Before setting up any monitors, it’s worth understanding the trace structure Snowflake Intelligence produces.

Every Cortex Agent interaction is composed of a hierarchy of spans, each representing a different phase of the agent’s operation. Snowflake stores this trace data in native observability tables, accessible via a table function called SNOWFLAKE.LOCAL.GET_AI_OBSERVABILITY_EVENTS, which can be called directly in a Snowflake SQL worksheet .

For example, the query that Monte Carlo runs under the hood to power our agent performance monitors looks like this. It’s worth taking a closer look, as it reveals exactly what Snowflake is logging at the span level:

SELECT * FROM TABLE(
SNOWFLAKE.LOCAL.GET_AI_OBSERVABILITY_EVENTS(
'YOUR_DATABASE',
'YOUR_SCHEMA',
'YOUR_AGENT_NAME',
'CORTEX AGENT'
)
)
WHERE RECORD_TYPE = 'SPAN'
AND RECORD:name::STRING NOT LIKE 'SqlExecution_%'
Running the Get_AI_Observability_Events query on the “MSegner_Demo_Agent” in Snowflake.

The SqlExecution_* spans are filtered out deliberately because they represent the actual SQL runs triggered by Cortex Analyst and tend to produce noise if included in agent-level metrics. When you take them out, the agent behavioral trace remains.

Mapping records to span types

Each span has a record_name that identifies its span type.

This metadata is critical to understanding what each record_name actually represents in terms of agent behavior and its what makes the trace data actionable. Without this mapping, you’re looking at a flat list of cryptic strings with no clear signal about where in the agent’s operation something went wrong.

Here’s how they map to agent operations:

CASE
WHEN record_name = 'AgentV2RequestResponseInfo' THEN 'chat'
WHEN record_name LIKE 'ReasoningAgentStepPlanning-%' THEN 'planning'
WHEN record_name LIKE 'ReasoningAgentStepResponseGeneration-%' THEN 'response_generation'
WHEN record_name LIKE 'CortexSearchService_%' THEN 'tool_call'
WHEN record_name LIKE 'CortexAnalystTool_%' THEN 'tool_call'
ELSE 'unknown'
END AS request_type
Agent telemetry, record_name, in Snowflake.

What each span type represents:

- chat : The top-level span for a complete conversation turn. AgentV2RequestResponseInfo is the outermost envelope: it captures the full input and output of a single user interaction, plus the thread_id that ties multi-turn conversations together.

- planning : ReasoningAgentStepPlanning spans capture the agent’s decision-making: which tool to call next, what query to formulate, what context to pass. Each planning step has its own token count, model name, and tool selection logged in RECORD_ATTRIBUTES.

- response_generation : ReasoningAgentStepResponseGeneration spans capture the final answer synthesis. The agent has finished calling tools and is now producing its response. Token counts here reflect the cost of the final LLM call.

- tool_call : Two tool types are logged with different attribute schemas, including:

  • CortexSearchService_* spans log the search query, columns requested, filters applied, result limit, returned results, and status.
  • CortexAnalystTool_* spans log the messages passed in, the semantic model used, the SQL query generated, the text response, and — notably — a question_category field that classifies the type of question Cortex Analyst received.

When it comes to the question_category field, Cortex Analyst categorizes questions automatically, and that classification is observable in the trace data. Categories can include things like simple lookups, aggregations, time-series queries, and comparative analysis.

If you’re seeing token spikes or latency increases on a specific agent, filtering traces by question_category can quickly show whether the pattern is concentrated in a particular question type.

More complex analytical questions tend to generate significantly more elaborate SQL and longer responses than simple lookups, so a shift in the mix of question types hitting an agent can move aggregate metrics even if nothing in the agent configuration changed.

Key performance metrics to monitor

Now that we have a good understanding of span types, let’s dive into the key performance metrics teams should be monitoring: token count, status codes, and duration. Abnormal behavior almost always shows up in these metrics before anywhere else.Total tokens

Total tokens

Token consumption (cost) reflects how much context the agent is processing and generating across its spans; the more work you’re doing, the more tokens you’re going to eat up. Analyzing that a bit further, typically a sustained increase in mean tokens indicates one of a few things:

  • Input length changes: users are sending longer queries, or the calling application is passing more context with each request.
  • Context window accumulation: in multi-turn conversations, Cortex Agents maintain thread history via thread_id. Each subsequent turn carries the full prior context, so a session that starts at 5k tokens can exceed 40k by turn 10. If mean tokens are climbing gradually, multi-turn context growth is often the cause.
  • Retrieval behavior changes: if the agent uses Cortex Search, a change in retrieval configuration (more results returned, larger chunks, different reranking) shows up as a token increase without any change in user behavior.
  • Increased tool call depth: more tool calls per conversation means more planning spans, each with their own token cost. Changes to tool routing logic become clear here.
  • Output verbosity drift: less common, but changes to the underlying Cortex model can affect how verbose responses are. The model_name field on planning spans makes it possible to correlate token changes with model versions.

In Snowflake, token accounting happens at the span level, not the conversation level. Snowflake logs prompt tokens and completion tokens separately within RECORD_ATTRIBUTES for each span, meaning that a single conversation generates multiple token counts across its planning, tool call, and response generation spans combined.

Token_count for the ReasonAgentStepPlanning-O span shown in Snowflake.

A best practice, whether you are using an agent observability tool or more manual process, is to resolvethese into a unified total_tokens metric per span and to look at the mean across all spans for that day, not per conversation. This is important when interpreting token spikes.

Duration

Duration tends to correlate with token consumption; more tokens generally means longer processing.

However, duration can also spike independently when the agent is making more tool calls than expected, hitting slow underlying data, or retrying failed operations. Watching both together gives you more diagnostic signal than either alone.

Latency distributions like P50 and P90 also give you a more honest picture than means alone: a stable mean can mask a growing tail of slow responses that’s quietly eroding user trust.

While neither token consumption nor duration tells you what went wrong (this requires either human or agentic troubleshooting), both tell you something changed. This is quite powerful because it’s often a very hard problem to detect with agentic systems that fail gracefully rather than crashing out.

Status Codes

Status codes are also tracked per span directly in Snowflake’s trace data, with each span resolving to either STATUS_CODE_OK or STATUS_CODE_ERROR.

Errors, therefore, are observable at the span level, meaning you can distinguish between a planning step failure, a tool call failure, and a response generation failure. These are three very different root causes with different remediation paths, and so having that obvious distinction is crucial for troubleshooting agentic failures.

Derived metrics such as completion rate, the proportion of spans resolving with STATUS_CODE_OK vs STATUS_CODE_ERROR, can catch silent failures that duration and token counts miss entirely.

Combining Signals

Once you’ve established baseline metrics and have a feel for normal agent performance, the monitoring surface area opens up considerably, and this is where agent observability starts to feel meaningfully different from traditional data pipeline monitoring.

For example, evaluation monitors monitor for meaningful regressions in the fitness of the final agent output. Was it helpful? Did it complete the task? Was it in the right language or tone of voice?

The real power comes from leveraging agent observability solutions to combine these signals across span types. High token counts on planning spans alongside low completion rates on tool call spans points to a very different problem than stable performance metrics with declining evaluation scores.

An agent monitor set to alert on duration OR token anomalies grouped by day shown in Monte Carlo.

Span-level visibility across multiple signal types simultaneously is what makes the difference between knowing something is wrong and knowing where to look.

Common agent performance issues

Token spikes: what they look like and what causes them

A pattern common in the early weeks of a Cortex Agent deployment is a step-change increase in mean token consumption, mean tokens roughly doubling within a few days before partially correcting.

An initial token spike in the early life of a Cortex agent shown in Monte Carlo.

This kind of pattern has distinct possible causes worth working through systematically:

Sudden spike then partial correction: This suggests a temporary change that partially resolved, with common causes being a specific high-token user session that inflated the daily mean, a temporary change in system prompt or tool configuration that was reverted, or a burst of multi-turn conversations that accumulated context before users abandoned them.

Step-change up with no correction: This suggests a persistent change. Likely causes can be a system prompt change, a new version of the calling application passing more context, or a retrieval configuration change that permanently increased result set sizes.

Gradual drift upward: this is the most concerning pattern for multi-turn agents. If thread history is accumulating without truncation, mean tokens will climb steadily as users have longer conversations. This is manageable but requires explicit context window management in the agent configuration.

For any of these patterns emerging with your Cortex Agent, you can troubleshoot them manually or using an agent observability solution. The next step here would be to view the traces, filtering to the affected date range and inspecting individual spans to compare input lengths, planning step counts, and tool call behavior before and after the spike. Some agent observability tools provide agentic root cause analysis to accelerate this step.

The question_category field from Cortex Analyst spans can be particularly useful here: if the spike coincides with a shift in question types (e.g., more complex analytical queries), that’s a different problem than if question types stayed constant.

Usage volatility

The second pattern that appears consistently in early Cortex Agent deployments is high variance in daily usage volume.

High token variance shown in Monte Carlo.

For enterprise internal tools and B2B workflows, where many early Cortex Agents are commonly deployed, usage tends to be spiky rather than uniform. Traffic concentrates around business hours, specific workflows, and particular user cohorts, so days with low traffic can produce near-zero row counts in the agent observability tables.

This has two practical implications for monitoring:

Baseline quality is uneven. Anomaly detection models trained on high-variance data produce wider confidence intervals, meaning they require larger deviations to trigger alerts. Days with near-zero traffic contribute almost no signal to the baseline, and means calculated on small sample sizes are unreliable. Teams with highly variable traffic patterns may need to tune monitor sensitivity themselves or consider filtering to business-hours windows to get cleaner baselines.

Low-traffic days can mask failures. If an agent is experiencing elevated error rates but daily volume is low, the aggregate metrics may look normal simply because there aren’t enough spans to move the mean. This is the argument for adding error rate monitoring as a separate signal, tracking status_code = 2 (STATUS_CODE_ERROR) as a proportion of total spans rather than relying solely on performance metrics to surface problems.

The underlying principle

Snowflake Intelligence generates more observability data than most teams realize. There is a wealth of trace data, including conversation history, span-level token counts, tool call inputs and outputs, status codes, model names, and question categories.

The challenge is not actually what data is available, but rather the ability to watch it continuously learn what normal looks like, and surface deviations before users notice them. Also, all of this needs to be done AT SCALE, as you and the rest of the world races to deploy fleets of agents, not just single-agent workflows.