Generative AI Updated Mar 12 2025

10 Learnings After a Year of Building AI Agents in Production

horizontal vs vertical task splitting GenAI
AUTHOR | Elor Arieli

The AI landscape is expanding all the time—and with that expansion comes all kinds of new opportunities to reshape how we think about building for the enterprise. 

Over the last year, the engineering team here at Monte Carlo has worked with, advised, and enabled hundreds of customers on their AI journeys—from chatbots and RAG pipelines to natural language analytics enablement and structuring unstructured data—and all that with Monte Carlo’s data + AI observability protecting output reliability along the way.

But we don’t just enable our customers to use AI. We take an AI-first approach to Monte Carlo as well, democratizing data+AI reliability with thoughtful automation and generative features to make data+AI management easier—like AI-powered monitor recommendations, SQL authoring with AI, and soon, Gen-AI powered troubleshooting (more news to come).

An example of how Monte Carlo uses an LLM to automatically identify relationships between data fields, like pitch type and speed. Read more here.

After a year of deploying AI agents within Monte Carlo, we’ve walked away with more than a few learnings to share—and that’s exactly what I’m going to do here today. 

In this article, we’ve consolidated 10 key lessons and takeaways you can implement right now to grease the wheels for your own AI journey.

Let’s take a look!

Lesson 1: Separate AI research from product research

First and foremost, keep product requirements out of your research.

In the early phases of a new AI project, it’s easy to let your research become constrained by a product’s current capabilities, but it’s important to remember why we research. Research doesn’t just tell us what we can do right now—it shows us what we could do tomorrow if we make the right bets today.

So, keep that framework in mind as you research. Look for opportunities to go beyond the scope of your existing product to derive new value or improve it. The easiest way to do that is to know when and how to involve your product team. 

While this can certainly look different for every team, I would recommend shielding your product team from any preliminary AI research to give your team ample time to digest the project’s possibilities.

Once you’ve identified where and how you might provide new value for customers, invite the product team to help focus and align those ideas to broader company initiatives. This ensures that your project remains maximally innovative without wasting resources to deliver something that’s out of step with the long-term vision for the product.

Lesson 2: Employ observability into model output

It’s true that an AI is only ever as useful as its outputs are reliable. And at Monte Carlo, we take reliability very seriously.

As an engineering team, we knew early on that if we were going to build AI workflows into our product, their outputs would need to satisfy our users. For us, that meant monitoring AI outputs just as much as we monitor the data, system, and code that powers it.

While there’s a lot of noise around how to monitor model outputs right now, in my opinion, it should include some combination of:

  1. Comparing results to ground truth data (when available).
  2. Evaluating results by using other models, also known as evaluators or LLM-as-a-judge. These models will at times be trained with the same ground truth data mentioned in #1.
  3. Tracking user behavior in response to results, like acceptance rate, adoption, etc.
  4. Scoring results by human evaluators. This can also be used to create the set in #1 and training the evaluators noted in #2.

Lesson 3: Validate results with data or human interactions

While AI models are amazing at some things, they’re not-so-amazing when it comes to consistent reliability. 

One of the charms of AI being a predominately black box solution is that while hallucination is a known issue, it’s not always clear how or why it happens. What’s worse, these hallucinations can cascade into runaway mistakes in agentic applications where multiple AI workflows are often strung together.

To combat hallucinations and output errors in complex agents, we recommend adding intermediate steps in the process to validate results either by hand, through sample ground truth or through deterministic code. In our own AI monitor recommendations, we check each monitor against a sample from the table to validate that the AI proposed monitors check out with the data. We then give the user final arbitration to accept or reject the monitor once recommended. 

From my perspective, keeping a human-in-the-loop for any critical functional or performance level decisions is always a good idea. Automation is great—and definitely use it as often as you can—but at the end of the day, nothing beats human experience. Even just one validation every few agent steps can reduce the likelihood of model runaway by a demonstrable margin.

Lesson 4: Divide and aggregate small agentic tasks 

When it comes to creating complex agentic workflows, one best practice we’ve found to work well is splitting the agent’s output into multiple simple tasks. These simple tasks can then be performed by a small-fast-stupid model, while a single large-slow-smart model can be leveraged in parallel to validate, aggregate, and summarize the information before passing on the results. 

This workflow creates a good balance between speed, cost, and quality of results.

Lesson 5: Split tasks horizontally to reduce runtime 

Let’s say we have two tasks we want to perform (pink and blue) on a large dataset. Maybe we want to extract fields from queries to understand how and where each of them is used (what join, filters, groupbys, etc.). Or, maybe we want to summarize how each field in a table is used and create rules to manage them.

It can be difficult for a model to do both tasks at once on a large input, which makes splitting tasks a good idea here. But while any split would be beneficial, not all approaches to splitting tasks are created equal.

Vertical task splitting

Vertical task splitting refers to splitting the operation into two consecutive tasks, where the output of the first task becomes the input of the second—with a large model powering both of these tasks. 

As an example, you might have an LLM that received 20 queries and you extract all the fields from those queries. Then, you might have a second LLM that extracts the operations in which each of those fields is used. This is what’s known as “vertical splitting” and you can see this strategy visualized below.

Horizontal task splitting

Alternatively, horizontal task splitting refers to doing both of the tasks at the same time but on a much smaller input. 

Using the above example, we might ask the model to extract both the fields and the operators in which they are used, but only give a single query per model per call. With this method, we would have 20 concurrent LLM calls, but we can use a smaller model for each call since the task is simpler. Then, we could aggregate the results from all sub-agents to arrive at the same final output. This can be visualized with the image below.

Leveraging horizontal task splitting minimizes the number of input tokens, output tokens, and model size required to complete an operation—reducing the total runtime while simultaneously maintaining (or even reducing) costs over and above vertical splitting.

Twenty smaller models running in parallel and outputting a smaller number of tokens is almost always faster than a single large model running on all the data all at once.

Lesson 6: Some tasks will have to wait for better models

When it comes to how we develop for AI, we generally approach our projects from a “greatest to smallest” mentality. By that I mean that we typically develop our MVP solution using a larger model to understand the full capabilities, and then move to a simpler model at a later stage in the project to reduce cost and increase performance.

In a few cases, we’ve tried a small model in early development, but unfortunately, this approach hasn’t yielded quite the same quality of results—and by the time we finished, there was already a newer small model that could! 
And I guess that’s really my point here. It’s important to remember that newer and better models are coming out all the time. So, if you can’t get something done the way you want it on day one, that’s okay. Just be patient, and keep developing. You’ll likely get access to a better version with better results by the time you’re ready for production.

In my opinion, that’s one of the biggest benefits of working with major cloud foundational model providers; things that seem like an issue today likely won’t be tomorrow.

Lesson 7: AI solutions can be valuable for fast development

Throughout the AI development process, there will be things that can be solved by either deterministic code or by an LLM. 

Ordinarily, we’re of the mind that whatever can be solved in code should be. LLMs are wonderful—but as we’ve identified previously, that doesn’t mean they’re always reliable. 

However, there are some cases when using an LLM for a given task does make sense. For example, when we were building ASTs (abstract syntax trees) for SQL query parsing, we used an LLM to extract the info instead of doing it with ASTs, because the ASTs would have taken us longer to develop.

In instances like this, the LLM enabled us to develop faster, but it didn’t supplant the use of code in the final product. My perspective is that LLMs are great for rapid experimentation, but you should always be ready to undergird or replace that experimentation with code in the future.

Lesson 8: Establish cooperative workflow between product engineers and data scientists

When working on an AI project that involves both data scientists and product engineers, we had  a division of labor: the data scientists developed the agents, selected the models, and evaluated performance—while product engineers put them in production, built APIs, and developed user experiences. This splits the work efficiently, but it can also create some unfortunate discrepancies. 

For example, when there’s an issue, only a backend engineer might have access to the AI stack, system, and logs to understand it—whereas only the data scientists have the expertise to actually solve it. 

For us that meant that engineers and data scientists needed to get really good at cooperating. By that I mean that the data engineers needed to collect and transfer as much information as possible back to the data scientists to empower them to resolve issues as they arose. 

This learning was especially important for us when it came to clearly logging and marking each call to the LLM in order to track which agent tasks were causing issues or bottlenecks in the workflow.

Lesson 9: Use LLM clone models to keep data secure

Most customers don’t want their data exposed to foundational model providers like OpenAI and Anthropic—and understandably so. And if they don’t want their data exposed to them, they definitely don’t want their data training them.

LLM clone models that stay within your own environment can be a great solution here. We chose Anthropic Claude model clones served on AWS Bedrock within our own AWS environment, but other solutions like OpenAI inside Azure or Gemini inside Google Cloud can also be great choices. 

It’s also always possible to serve your own model using these services. This solution can look different for each team!

Lesson 10: Leverage structured output response formats

The release of structured output formats has been a huge change (and help!) for developers using AI in production. 

Leveraging structured output formats allows teams to define the exact format a model response should arrive in. This governed consistency allows data teams to easily mix AI with deterministic code to deliver a response structure that’s deterministic even if the content itself is not. And when it comes to developing AI, my biggest takeaway is this—the more control you can have, the better.

The future is data + AI observability 

Data and AI are no longer two separate systems, they’re one in the same—and that means the first-party data that augments your AI is as inextricable to its success as the input and model engine that power it. 

As we continue to develop new AI systems for Monte Carlo—and as we work with our customers to observe the data, system, code and model of their own agentic solutions—we know the learnings will flow right along with it.

Technologies will come and go. Models will rise and fall. But one thing that’s become absolutely clear is this—those AI pipelines will need to be observed by an end-to-end data+AI observability solution. You can’t deliver customer value without trust—and you can’t deliver trust without reliable data and AI. If you only take one lesson away, I’d make sure it’s that one. 

Want to learn more about how Monte Carlo can support your next data+AI project? Give us a call!

Our promise: we will show you the product.