Data Observability, Generative AI Updated Apr 09 2025

Monitoring Unstructured Data with Monte Carlo

reliable unstructured data
AUTHOR | Elor Arieli

The idea of leveraging unstructured data in production isn’t new by any means — but in the age of AI, unstructured data has taken on a whole new role.

Ubiquitous access to models that can easily extract insight and information from text, images, and videos has opened the door for organizations to take advantage of datasets that were previously ignored. Internal documents, manuals, policies, qualitative user feedback, call transcripts, and many others are now fair game for data + AI teams, especially as both Databricks and Snowflake have made it increasingly easier to ingest and process this data.

Much like traditional data pipelines, bad data can wreak all kinds of reliability havoc in your AI applications. That’s nothing new. What is new is that, in the age of AI, that bad data is becoming increasingly unstructured.

That’s why Monte Carlo’s data + AI observability doesn’t just scale coverage for structured data—it empowers data teams to identify, triage, and resolve issues in their unstructured data in Snowflake, Databricks, and BigQuery.

Read on to learn how our customers leverage Monte Carlo to monitor and troubleshoot anomalies in unstructured data pipelines for three popular use cases.

1) Monitoring user sentiment

User sentiment is often some of the most valuable data product teams can access—and it’s also some of the most difficult to quantify. Fortunately, Monte Carlo empowers data + AI teams to evaluate sentiment using our own metric monitors as a baseline. 

In our case, we’re interested in understanding how our customers engage with the alerts we send them and whether there are meaningful changes in their sentiment. Here’s how the Monte Carlo team uses our metric monitors to measure sentiment in our users’ (freeform natural language) replies to the alerts generated by our system.

Monitoring user sentiment in Snowflake

  1. First, create a metric monitor on the table and choose a time-axis and aggregation of your choice. (In our example below, we created a metric monitor and bucketed by day based on export_ts).
  1. We then created a custom metric that is the average of the sentiment score (Snowflake Cortex outputs a sentiment score between -1 and 1). Note that “text” here is the name of the column holding the Slack messages.
  1. Finally, we set an alert threshold. For example, we choose to set it to ≤ -0.5 as an average, which is quite negative,  but you should set it to whatever fits your needs. You can also set it to a machine learning threshold and have Monte Carlo track the metric value for you. Remember, the range is from -1 to 1.

Monitoring user sentiment in Databricks

  1. When using Databricks, you will start by creating a metric monitor on the table and choose a time-axis and aggregation of your choice. (Again, for our example below, we chose to create a monitor on the table bucketed by day based on export_ts. You should choose the settings that work for you.)
  1. Since Databricks’ AI sentiment analysis returns strings (positive, negative, neutral) and not numerical scores, we segment the data based on the sentiment provided by the LLM rather than averaging over the scores.
  1. Finally, we set the monitor to track relative row count to track the distribution, or the percentage of data each segment represents, on the sentiment response so that we’re alerted when the distribution of sentiments in the responses changes. In this case, since we track all the different sentiments, you may set it to simply detect anomalies automatically.

2) Tracking user experiences in sales calls

Hopping on a call with a customer, a prospect, or even a partner, can deliver a treasure trove of insights. The question is—how do you activate them? Or more relevantly—how do you validate them?

At Monte Carlo, we use Gong to track and store summaries of all our sales calls—and all the data that comes with them. Feedback around user experience, monitor creation experience, and integration experience are all data points we care deeply about; and particularly when those metrics deviate from the normative Monte Carlo experience. 

To validate our customer and prospect experiences, we use custom SQL rules to monitor how often (%) specific issues are mentioned, and set up alerts to know when that percentage changes. Here’s how we do it.

Monitoring data in Snowflake

  1. In this example, we first create a custom SQL rule to calculate the % of calls in the last day where a specific issue is mentioned and we set it to track “value returned.” It’s also possible to track “rows returned,” but we opted to track % and not count, so in this example, “value returned”  was a better fit.
  1. We then added the different prompts to the classification model as variables in order to track multiple issues using the same rule. Each variable was set as a prompt that explains the task, In our case, the prompt included topics to look for in the call summary.

Monitoring data in Databricks

  1. Similar to what we did in our Snowflake example, we start by setting up a custom SQL rule to calculate the % of calls in the last day where a specific issue is mentioned and set it to track “value returned” (not “rows returned” for the same reason mentioned above).
  1. Just as we did on Snowflake, we then add the different prompts to the classification model as variables, where each variable is a prompt that explains the task to be performed.

3) Tracking model outputs

Your AI pipeline doesn’t just begin with unstructured outputs—it ends with them. Monitoring the unstructured outputs of an agentic system is one of the most fundamental ways we validate the reliability of agents.

At Monte Carlo, we’re hard at work developing a host of agents (like our upcoming Troubleshooting Agent) to help our customers deliver reliable data + AI at scale. But delivering agents to production means monitoring their outputs at scale. To do that, we leverage a secondary LLM agent to cover multiple key metrics. In this blog, we’ll cover two of them: prompt alignment and answer relevance.

In the case of our Troubleshooting Agent—a resource designed to identify solutions for data anomalies—we handle it this way:

  1. First, we output all of our model calls to a table. In our case, we called it res_bot_responses. In this table, we store the incident on which the agent ran, the agent responses, the model used, the parameters, the prompts, and a few other features we use to track our model’s output quality.
  2. In order to track prompt alignment, we use another LLM (preferably from a different vendor so it has different biases) to score how well the model output aligns with the objectives given in the prompt. We then track average alignment score over time using Monte Carlo.

To set this up on Snowflake, users can:

  1. Use Snowflake’s general LLM function: “complete.” You can then make the model return a score or null for each of our model’s outputs.
  2. Using SQL rules, we calculate the average score for the latest model outputs and get alerted when that value is anomalous.
  3. Use variables for segmentation by incident_type.
  4. In this example, we set the threshold to be automatic on the average score (value based) returned. Alternatively, you could set it to return all rows where the score is below a certain threshold (and which specific cases were anomalous). We chose the latter in order to see all the model outputs which were rated below a certain score.

Here is how it looks for Prompt Alignment without segments:

And this is how it looks for Answer Relevancy with segments:

To set this up on Databricks, users can:

  1. Use Databricks’ general LLM function: “ai_query” and make the model return a score for each of our model’s outputs.
  2. Using SQL rules, calculate the average score for the latest model outputs and get alerted when that value is anomalous.
  3. Add variables for the incident_type segments.
  4. Just like with Snowflake, you can set your threshold to be automatic on the AVG score (value based), or set it to return all rows where the score is below a certain threshold (and which specific cases were anomalous).

Note: it’s easier sometimes to monitor this data by creating a column populated with LLM evaluations as part of the transformation, and then tracking the output column directly using Monte Carlo.

And, of course, all model choices in this post are for illustrative purposes only. You will likely tweak the models (and the prompts) to your own use case. 

End to end data + AI observability across your stack

Much like the observability of traditional data pipelines, observability can only truly be effective when it’s deployed end-to-end. AI failures often begin with data issues, so understanding the health of both model inputs and outputs is critical.  

At Monte Carlo, we’re committed to defining the future of reliable AI for enterprise data teams. This means extending coverage into every layer and integration that could impact the reliability of your AI applications—beginning with the structured and unstructured data that powers it. 

Struggling with data + AI reliability? Let’s chat.

Our promise: we will show you the product.