Has AI-Assisted Coding Made Data Quality Better or Worse?
Table of Contents
Code breaks data. At least it used to.
Data teams write SQL transformations to shape raw data for downstream use cases. When those queries change, they can rupture dependencies or alter metrics in unintended ways.
But data engineers don’t write SQL queries alone anymore. According to a 2025 dbt survey, 70% of respondents use AI for analytics development.

Source: 2025 State Of Analytics Engineering Report
This shift led us to investigate a simple question:
Are code-based data quality incidents becoming a thing of the past?
Methodology
We analyzed 1,000 troubleshooting investigations from the past month across hundreds of customer environments. Using an LLM-assisted clustering approach combined with manual review, we categorized root causes into several classes including data source issues, system failures, and code changes.
From that analysis, the percentage of data quality issues resulting from code based issues is roughly 10%.

It’s important to note we are not claiming this to be a fully scientific process. The rate at which specific issues are found is dependent on the extent of customer integrations, and it is likely this is an underrepresentation.
But it’s still an informative data point nonetheless, especially when you consider just how good Claude and other LLMs are getting at generating code.
How has everything changed so much, and yet changed so little? In short:
| AI has helped reduce syntax-level failures in data pipelines. But most data quality incidents were never caused by broken SQL. They come from broken assumptions between systems. |
Lets dive into tales from the traces and ultimately what teams can do to solve this problem.
Syntax issues are largely extinct

Just last year we were still seeing frequent instances of queries failing from simple human error. For example, a missed semicolon at the end of a clause or metrics that accidentally divided by 0.
These code based data quality issues still happen today, but they have diminished considerably with AI-assisted coding practices.
Now, if you ask an LLM to “write a SQL query for conversion_rate = conversions / visits,” it will almost always guard against divide-by-zero errors by wrapping the denominator in a NULLIF clause.
But schema changes are still problematic
AI-assisted software engineers are shipping up to 60% more code to production.
These upstream applications evolve independently from data +AI systems, and software engineers still pay as much attention to their data exhaust as they always have (which is to say not much).
The result is a data ecosystem where schema volatility is increasing, even as query generation becomes easier.
Here are a couple of anonymized examples of schema changes breaking hardcoded data pipelines from our analysis:
- Advertising campaigns were missing the standard country identifier pattern in their campaign names (ex. “UK_enterprise_campaign.). Downstream query logic parses country codes from those campaign name patterns. In this case those fields were set to NULL resulting in further query and join failures in the dimensional model.
- A new Salesforce value (se_role) introduced a new column in a shared view that downstream transformations depended on. Joins depending on a specific schema layout began returning unexpected results, and dashboards built on those models showed shifts in segmentation metrics.
Semantic drift: new logic breaks old assumptions
Upstream changes aren’t constrained to changing campaign and column names. When upstream logic changes meet old assumptions, data quality issues abound.
In our analysis, we saw an example where an upstream product team changed how “active users” were defined, but downstream models continued using the old definition.
Previously, a user was classified as active if they had an active subscription record with status = ‘active’. The product team updated the logic so that users in a grace_period or trial state were also treated as active for product access.
However, downstream data models had hardcoded assumptions based on the original definition. One model calculated membership tiers and revenue segments using SQL like:
SELECT
user_id,
CASE
WHEN status = 'active' THEN 'paid_member'
ELSE 'inactive'
END AS membership_tier
FROM subscription_status;
Once the upstream service began marking grace_period and trial users as active in its view, the downstream model did not recognize those new states. As a result users were incorrectly categorized as inactive and key metrics were incorrect.
AI can’t help logic mistakes
Sometimes it’s not just new logic breaking old assumptions, but data professionals making incorrect assumptions or other innocuous mistakes.
In one example from our analysis, a pipeline used a unique key to determine whether rows should be inserted or updated.
The SQL compiled successfully and the job completed as expected. But the merge condition did not fully capture all fields that defined a unique customer record. When new records arrived that differed slightly from existing ones, the merge logic treated them as new rows rather than updates.

Over time this created duplicate records in what was expected to be a deduplicated table.
This is another class of problem that AI-assisted coding does little to prevent. The SQL was syntactically correct — the mistake was in the logic used to identify and merge records.
Time windows are tricky
In our analysis, we saw several examples where pipelines applied incorrect assumptions about how records arrive over time. One downstream model calculated daily investment activity.
To reduce processing time, the pipeline only loaded records that had been updated since the last run. The assumption was that any new transactions or corrections would appear with a more recent updated_at timestamp.
In practice, the upstream system occasionally produced late-arriving adjustments or backfills. Because the incremental filter relied on updated_at, those corrections fell outside the pipeline’s processing window and were never ingested into the analytics model.

We also saw many examples involving slowly changing dimension (SCD) patterns. In these models, an entity like a customer ID may appear multiple times as its attributes change over time—for example, when a user upgrades or downgrades their subscription. The table typically includes metadata, like effective dates or a flag indicating which row represents the current version.

When late-arriving updates or other logic mismatches occur, missing records or duplicate entries can result, even though the SQL generated by AI was syntactically correct.
Against the grain
In another example from our analysis, a transformation to add user details assumed both tables were at the same grain—for example, one row per user. But the dimension table actually contained multiple rows for the same user. This caused the join to duplicate records in the resulting database, inflating viewership numbers downstream.
And the occasional hallucination
AI coding has gotten more effective than ever in the last few months, but let’s not forget there is a reason why every LLM has a disclaimer at the bottom that it’s “AI and may make mistakes.”
So what do we do about it?
Syntax issues and simple human errors no longer create as many data quality issues as they did just three years ago, and that is cause for celebration for anyone who appreciates a good batch of high quality data.

But bad data is still inevitable, as is bad data caused by query changes and failures. All is not lost, there are some easy best practices data and AI leaders can implement today:
- Data contracts: We’ve spoken about the advantages and disadvantages at length. But the quick summary is that schema changes for your most important pipelines can be avoided through proactive communication with your internal data providers.
- Data diff: Many teams analyze the output of new queries before promoting them to production — a process often called data diffing.This can catch those unexpected changes downstream, but for enterprise environments these analyses end up being a wall of very dense information making it hard to separate signal from the noise. All I will say is: watch this space.
- Data + AI observability: The reality is that no system will be 100% reliable. Just like no security system is hacker proof. Teams need a platform and systemic approach for identifying and quickly responding to incidents in production data + AI systems when they occur. There is a reason that Gartner says 96% of data + AI leaders have or plan to adopt data observability in the next year.
AI didn’t eliminate data quality problems. It simply reduced the easy ones.
The remaining failures — semantic drift, schema volatility, and system assumptions — are harder, subtler, and increasingly common.
And in a world where AI systems depend on reliable data, the cost of getting them wrong is higher than ever.
Our promise: we will show you the product.