Skip to content
AI Observability, Data Observability Updated Apr 22 2026

The Memory Problem Changes When Agents Stop Waiting to Be Prompted

AUTHORS | Michael Segner | Virna Sekuj | Elor Arieli

Amnesia by design

The vast majority of agentic systems in production today are reactive, meaning they act based on pre-determined criteria set by humans. That might be a manual trigger, a specific event, or scheduled job.

Monte Carlo is currently building a proactive multi-agent system. Agents will have the capability to observe a data environment and take independent action. For example, setting a quality monitor, routing alerts, and even remediating issues. A human can be in the loop (and should be) but the design is to minimize how often they need to be.

There are many challenging design decisions that we’ve encountered on this project, but perhaps the most important is how agents accumulate, share, and act on institutional knowledge over time.

A reactive agent’s memory answers one question: what should I recall? The user’s query, or a very defined set of system prompts, defines relevance. A proactive agent’s memory has to answer a different question: what should I notice?

What we found: proactive memory works best when it’s grounded in a live structural model of the environment. Not metadata stamped on memories at write time. Not attributes guessed from embeddings. An actual graph, maintained as the source of truth, that memories point into rather than duplicate.

In this piece, we will walk through the memory design tradeoffs we’ve encountered building Monte Carlo’s proactive data reliability system, how the existing approaches to agent memory fall short for the proactive case, and what we’ve built instead.

The latest thinking in agentic memory architecture

Early agents leaned on RAG and similarity retrieval inside a vector database, with occasional arguments for fine-tuning. Architectures have evolved since.

Andrej Karpathy’s LLM Wiki pattern argues the right answer isn’t retrieving from raw documents at query time but incrementally building a persistent, synthesized wiki that compounds knowledge over time. 

Databricks’ Lakebase-backed memory infrastructure shows two layered storage combining structured facts alongside vector search outperforms pure retrieval by meaningful margins. LangChain has published a taxonomy for thinking about memory across semantic, episodic, and procedural types.

Each is pushing past memory-as-retrievable-facts toward something that preserves context and relationship. 

Memory tradeoffs and failure modes

All of these approaches are excellent, and all have tradeoffs. Here is how we are navigating them.

Retrieval precision

Vector search returns semantically similar memories, not necessarily relevant ones, and model performance degrades past a threshold of injected memories. An explicit prioritization layer is required. 

In our memory query path, authorization metadata filters narrow the candidate set to what’s in scope before similarity scoring runs. Filtering on authorization first meaningfully reduces the false positive rate in multi-team environments, where the same table name or incident pattern can mean very different things across domains.

Stale, conflicting, and bad memories

Most systems have no principled mechanism for resolving conflicting or stale memories. A bad memory produces a bad action, which may be observed and stored as a new memory and propagate the original error. 

Surfacing memories and making them editable within Monte Carlo. Provided by the authors.

The minimum viable answer is surfacing memories so humans can examine them within their natural workflows and edit and delete them as needed. Our system also updates memories when something “similar but different” occurs, but we are still working through all of the challenges related to stale and conflicting memories. 

Missing provenance

In most implementations, memory influences behavior invisibly. There is no audit trail for which stored fact shaped which response. That’s a trust problem. We tag every memory with the agent that wrote it, the source it came from, and the domain it belongs to, so the reasoning chain is inspectable.

Without memory, the agent gives a reasonable but generic answer. With memory, it pulls from what it already knows about your environment. In this case, a known incompatibility between DynamoDB key formats and the OnlineInference service that causes alerts to silently fail. Image provided by the authors.

Precision, bad memories, and missing provenance are solvable problems in reactive systems with careful design. In proactive systems, they get harder and new problems appear.

The proactive memory problem

So let’s (finally) talk about the shift from, “what should I recall” to “what should I notice?” It’s a bigger shift than you might imagine. 

Determining relevance

For example, how can the agent determine what is relevant? In a reactive system, this is defined by the user along with guidance from the system prompt.

In a proactive system, memory has to encode enough structure about the environment that the agent can reason about it independently. In other words, the agent must maintain a model of what normal looks like so it can recognize when something deviates. 

This is somewhat addressed, for example, in Karpathy’s wiki pattern which compounds knowledge across runs. However, establishing what normal looks like (and deviations from it) requires repeated observation of the same signals over time. 

Determining causality

Reactive memory is mostly facts and preferences. “This table has a known Monday delay.” A proactive agent needs to recognize temporal and causal structures. For example, this pattern preceded an incident three times in six weeks. You can’t derive that kind of insight from a pile of discrete facts, no matter how good your retrieval is. 

This is particularly important because it helps provide context to the agent on the first question humans instinctively ask when deciding between tasks, “is the juice worth the squeeze?” When we first started Monte Carlo, we quickly learned teams didn’t want to be alerted to everything, they wanted to be alerted to the important things (and told where the problem was).

Determining corrections

In a reactive system if the agent gets something wrong, the user provides a correction and the system improves, sometimes mid-task. But in a proactive system, that direct human feedback is more sparse by design. 

Source: Wikipedia.

In this case, the production data environment also becomes the reinforcement learning environment. An incident that wasn’t caught becomes the correction mechanism. The signal arrives later, is harder to attribute, and delays or missed opportunities for correction could compound recurring issues.

How we’re approaching the proactive memory problem

Similar to the aforementioned approaches, we use a two-layer architecture that reflects two different retrieval needs

We use a semantic vector store (mem0 + pgvector) for “find me something similar to this.” This is used for fuzzy retrieval of past incidents, patterns, and context. We use a structured key-value store (LangGraph BaseStore) for “what do we know about this specific thing.” This is a direct lookup on a specific asset, incident, or tool behavior. 

Both stores are populated by the same extraction pipeline. Tool outputs are run through a lightweight extraction model that decides what’s worth remembering and writes the distilled fact to the appropriate store.

Image created by the authors.

However, the most important architectural decision we’ve made is one that isn’t really a memory storage decision at all: we built on top of data lineage–essentially a form of an entity graph.

Putting agents on rails

An agent without rails will happily explore forever. We learned this the hard way when a new ReAct sub-agent inside our Troubleshooting Agent shipped with a 50-tool-call ceiling and no strong priors about where to start. 

Our AWS Bedrock spend increased 7.6x in three weeks, driven almost entirely by this one sub-agent taking 20 turns to arrive somewhere a directed investigation could have reached in five. 

Agents need a neighborhood to explore in and priors about where to look first. Free exploration is expensive, slow, and less accurate. 

In our case, we already maintained a comprehensive model of the data estate. For example, entities like tables, lineage, domain, criticality, downstream consumers, query patterns, and incident history. 

When we built the memory system, we built it on top of that lineage. Memories attach to nodes that are already connected and continuously updated from our metadata extraction pipeline. An asset-keyed memory stored on the orders table isn’t a standalone fact because the table is rarely considered in isolation. Each agent considers the entire connected system, and associated memories, when setting monitors; grouping and routing alerts; providing the root cause of an issue; or taking any other action.

It’s also scoped by domain. A user with access to the Finance domain shouldn’t see memories derived from Engineering domain alerts. Great for governance and also great for correctness. 

This is the answer to “what to notice.” The proactive orchestrator doesn’t have to determine relevance from scratch. It traverses a graph it already knows, looking for nodes where something has changed relative to what memory says is normal. 

General-purpose memory frameworks treat memories as self-contained facts and have to guess which attributes to stamp on at write time. By treating memories as pointers into a live graph, they stay accurate as the data estate changes underneath them. 

The difference matters more as the system gets more autonomous.

Providing a framework for priors

Even with a well defined path and multi-layered memory system, it’s easy for agents to get overwhelmed and separate the signal from the noise. While a truly proactive system would need no additional instructions other than the word, “go,” we’re not quite there yet.

So we operationalize “noticing.” Rather than having memory implicitly determine what gets acted on, we’ve made it explicit. For example, the triage agent scores each incident using a composite of table criticality, use case importance, downstream consumer count, and historical patterns (memory!). 

The triage agent can recognize that a pattern has appeared before, that it has or hasn’t previously led to an incident, and that the current situation is or isn’t in a known range. 

Bringing all of this context together is critical because even if something deviates from the norm that doesn’t mean it’s significant. At Monte Carlo we have a saying, “not all anomalies are interesting.”

We also categorize memories so agents can put them in context (and meet regulatory requirements). Six distinct memory sources feed both layers. Each has a different provenance, scope, and governance requirements. 

Providing this framework requires domain specific knowledge. Our team needs to understand entity relationships, common patterns, causal chains, key signals, and next steps. Understanding how the business works and how organizations operate is as important as it’s ever been.

What’s Still Hard

We want to be honest about what we haven’t solved, because anyone building toward proactive agents will hit the same walls. We still have questions on exactly how to determine what new patterns are worth committing to memory, ongoing maintenance, and exactly how the correction loop will work at scale.

But early results are promising. The design shift required to go proactive isn’t just adding more memory or better retrieval. It’s building memory that encodes the structure of the environment well enough that an agent can reason about it, and most importantly recognize deviations from normal that are meaningful. 

The memory system didn’t have to define how tables relate to each other, which assets are critical, or which teams own what. All of that was already there. That’s the combination that makes a proactive agent possible: not just memory, but memory that knows where it is.

##