4 Famous AI Fails (& How to Avoid Them)
AI hallucinations may grab the headlines, but they aren’t the scariest thing happening to AI in production. In fact, much of what masquerades as “hallucination” in today’s AI systems is really a data or operational issue in disguise.
Some of the biggest AI reliability issues include:
- Poor source data quality – Inaccurate, incomplete, or outdated data feeding your models
- Embedding drift – Vector representations that no longer match current reality
- Confused context – Missing or irrelevant information in RAG pipelines
- Output sensitivity – Small input changes causing wildly different results
- Too many humans in the loop – Bureaucratic overhead that slows response to issues
- Not enough humans in the loop – Insufficient oversight allowing harmful patterns to persist
Recent research demonstrated how vulnerable non-deterministic AI systems are to hijacking, even with little or no user interaction. But the reality is, you don’t need sophisticated attacks to expose AI vulnerabilities. If you’re deploying AI to production without an operational model to manage end-to-end reliability at scale (data, system, code, and model), the smallest oversight might just become the next big headline
In this article, we’ll look at four of the biggest AI fails of the last year, and what—if anything—could have prevented them. I know it’s scary, but don’t cover your eyes. Let’s dive in.
Table of Contents
1. McDonald’s McHire Bot: When AI Manages AI (Badly)
In the summer of 2025, security researcher Ian Carroll ran into a Reddit thread where McDonald’s applicants complained about poor experiences with the company’s hiring chatbot. He decided to check it out — and quickly discovered absurdly basic security flaws in the company’s “McHire” platform.
According to reporting by Wired, McDonald’s had partnered with an AI software firm to build Olivia, a chatbot designed to automate early-stage hiring — collecting resumes, screening candidates, answering questions, and triggering personality tests.
An AI built the chatbot.
Then an AI managed the chatbot.
And like many chatbots rushed to production, the AI proved to be unreliable. No surprise. It failed to answer basic questions and created frustrating experiences for job applicants. But the real problems were hiding in the infrastructure.
Carroll and fellow researcher Sam Curry gained full administrative access by guessing that a Paradox.ai staff login used “123456” as both the username and password. There was no multifactor authentication.
Once inside, they found they could access the chat logs and personal information of virtually every applicant across a database of more than 64 million records. The researchers could increment applicant ID numbers and view contact information, employment history, and conversation transcripts of people actively seeking employment — the perfect material for targeted phishing attacks and payroll scams.

What went wrong?
Where do you start?
- The governance standards?
- The metadata or context engineering that could have prevented PII exposure?
- The need to monitor and remediate the model’s bad outputs (or inputs)?
- Or the need for a human in the loop that could own these pipelines?
We see it all the time—teams creating AI inadvertently creating unmanageable silos in the process. When you deploy AI to manage AI, reliability issues compound exponentially. Without visibility into how data flows through these systems—and governance to manage it—a single point of failure can cascade into unreliable outputs or a major security breach.
2. United Healthcare: Systematic AI Failure at Scale
In November 2023, a lawsuit filed in federal court alleged that United Healthcare was using a faulty AI model to systematically deny healthcare coverage to elderly Medicare Advantage patients, overriding physician recommendations for post-acute care.
When patients appealed those denials, nine out of ten were reversed in their favor — a 90% error rate.
But many never appealed, potentially going without the care their doctors deemed medically necessary.
Here’s what happened: United Healthcare had deployed an AI model called nH Predict to determine how long patients should receive care in nursing facilities following hospital discharge. The algorithm made predictions by comparing patients to a database of six million similar cases.
But the lawsuit claims these comparisons were inherently flawed. Investigative reporting revealed that the model didn’t account for comorbidities and individual patient complexity that physicians recognized as critical to recovery — instead producing generic recommendations that failed to capture the nuances of individual patient needs.
What’s worse, according to former employees, the system’s failures went beyond incomplete data. UnitedHealthcare instructed case managers not to deviate from the AI model’s predictions and held them to performance targets within 1% of the algorithm’s predicted lengths of stay. The lawsuit claims that doctors treating the patients weren’t informed about the AI’s determinations, and when they inquired, UnitedHealthcare employees said the information was proprietary.
What went wrong?
Even with accurate data, incomplete context leads to systematic failures — and without monitoring outputs against ground truth, you won’t catch it.
This case demonstrates the danger of deploying AI without proper feedback loops. The model made predictions based on incomplete patient context, with no apparent mechanism connecting the 90% appeal reversal rate back to performance monitoring. The system lacked transparency, preventing physicians from understanding or challenging the AI’s reasoning.
3. Taco Bell: 18,000 Waters Ordered from Voice AI
For most organizations, agents represent an opportunity to improve efficiency. Many fast food restaurants have already replaced their indoor order counters with digital kiosks to free up labor at the front of the house. Taco Bell took that idea one step further by leveraging AI chatbots to digitize their drive thru operations as well.
Unfortunately, leveraging a deterministic digital ordering program to move structured data from point a to point b isn’t the same as deploying a non-deterministic chatbot to interpret and deliver unstructured inputs to the same program: a lesson Taco Bell learned the hard way.
One customer crashed the system by ordering 18,000 water cups. Another customer was trapped in a conversation loop to order his drink. It was a headline-making failure.
These errors have since gone viral across social media, with one clip showing a man ordering “a large Mountain Dew” and the AI voice continually replying “and what will you drink with that?”
What went wrong?
You can’t deploy AI to production without the right mix of production-capable tooling and process to maintain it. Issues like this don’t just occur because a chatbot goes rogue. They happen as a result of insufficient context engineering, prompting, and parameter setting during model development—and they persist because sufficient context monitoring through observability hasn’t been established to detect it.
4. Chevy for $1
Prompt engineering will be key for the future of AI. But what happens when your chatbot starts taking prompt engineering commands from customers?
In one of the most talked about AI reliability issues, a GM dealership’s chatbot user used some sneaky commands to get the chatbot to agree to sell them a 2024 Chevy Tahoe—for just $1.

“We certainly appreciate how chatbots can offer answers that create interest when given a variety of prompts, but it’s also a good reminder of the importance of human intelligence and analysis with AI-generated content,” said a Chevy spokesperson.
What went wrong?
Agents are just as susceptible to bad inputs as any traditional data products—but in the case of AI, they don’t always come from your own pipelines.
A lack of foundational standards and the natural limitations of testing environments meant that a model that operated as directed in testing became quickly vulnerable at production scale.
BONUS: Citi Bank and The $6 Billion Copy-Paste Mistake
After a string of high-profile lawsuits in 2024 brought on by poor data management, a basic data entry error nearly sent $6 billion to a Citigroup wealth account by mistake.
In April 2024, a Citigroup employee copied and pasted an account number into the field meant for the dollar amount for a transfer — over a thousand times the intended sum. The near-miss was caught the next business day.
The same month, another part of the bank accidentally credited $81 trillion to a different client. These errors occurred while Citi was already under regulatory penalties for poor systems and controls, and in the midst of a costly “transformation” effort to overhaul operations.
Now to be fair, this wasn’t an AI system—but just imagine if it had been for a second. What could have happened? The issue might not have been caught until the checks bounced. The problem is that errors like this are just as likely to make their way into siloed and interdependent AI systems as they are your most basic data products.
At their heart, AI products are data products. And this incident (AI or not) demonstrates the incomprehensible risk of low-quality data in the wild. And if we don’t have strong safeguards in place for the basics, how can we ever expect to deploy AI agents in a production environment safely?
The Real Barrier to AI Adoption
There’s plenty of AI activity happening, but real-world business applications remain slow to reach production—and even fewer succeed once they get there. Business users don’t trust the outputs. And rightfully so.
According to a now-infamous MIT study, 95% of generative AI pilots at companies are failing to deliver measurable business impact. Not because of model quality, but due to flawed enterprise integration, incomplete data context, and lack of visibility into what’s actually happening in production.
Each of these five AI failures shares a common thread: the absence of comprehensive reliability practices throughout the development lifecycle. Whether it’s inadequate validation of data inputs, poor security controls, missing feedback loops, or lack of monitoring, the root cause is the same.
Every data and AI product—whether it’s a dashboard or a chatbot—needs a basic reliability loop to detect and resolve issues efficiently in production:
- Detect: Proactively monitor pipelines and data for anomalies and governance breaks.
- Triage: Assess the severity, scope, ownership, and required actions of incidents.
- Resolve: Rapidly address data, systems, code or model failures.
And that only happens when you treat your data and AI like a product, with the right mix of tooling and process to ensure reliable value delivery in production.
Building Reliability Into AI Development
That’s where data + AI observability comes in.
From the training and context data that feeds AI to the orchestration systems managing workflows and the model outputs a user consumes, a platform that unites data observability and AI observability provides end-to-end visibility into the health of both your data and AI in a single system.
Observability that’s built for production AI will:
- Monitor data quality in RAG pipelines.
- Track when context becomes incomplete or stale.
- Validate outputs systematically rather than hoping for the best.
- Establish feedback loops that connect model behavior back to data quality and business outcomes.
And modern observability goes beyond passive monitoring. Observability agents can actively participate in your development workflow, automatically detecting anomalies, accelerating root cause analysis, and suggesting remediation steps. Agent observability helps you understand what your AI systems are doing, why they’re doing it, and whether those actions align with intended behavior.
The question isn’t whether you’ll encounter reliability issues in your AI systems. You will. The question is whether you have the right frameworks in place to catch and resolve them before they make headlines or worse.
Building AI systems that stakeholders actually trust requires treating reliability as a zero-day concern—and deploying the right systems and protocols to manage it that way.
Our promise: we will show you the product.