Skip to content
Data Observability Updated Jul 01 2025

What is Data Reliability and How do I Make my Data Reliable?

data-reliability
AUTHOR | Barr Moses

Over the past several years, I’ve spoken with over 150 data leaders about the reliability of their data, ranging from a few null values to wholly inaccurate data sets. Their individual issues ran the gamut, but one thing was clear: there was more at stake than a few missing data points.

One VP of Engineering told me that before his team started monitoring for data downtime, their entire database of customer information was 8-hours off, revealing massive tech debt. Making matters worse, they didn’t catch this issue for several months, only identifying it during a data warehouse migration. While it ended up being a relatively simple fix (and an embarrassing discovery), it went undiscovered for far too long—and their revenue suffered in the process. 

This story is a common one — and no company is spared.

In this blog, we’ll take a deep dive into the building blocks of reliable data to understand what it is, where it falters, and what you can do to improve your data reliability at any scale.

What is data reliability?

Data reliability is the degree to which data remains accurate, complete, and consistent over time and across various conditions. That temporal element makes all the difference. Plenty of datasets look perfect during a quarterly audit but fall apart in daily use. A pipeline breaks, a source system changes, or volume spikes, and suddenly your pristine data becomes a mess. Reliable data stays trustworthy through all of that.

The difference between quality and reliability comes down to consistency. Your customer database might be spotless after a weekend cleanup project. Every field validated, every duplicate removed. But what happens Tuesday when three different systems start updating it? If errors creep back in throughout the week, you’ve got a quality moment, not reliable data. Reliability means maintaining those high standards continuously as new data flows in and conditions change.

Reliable data demonstrates several key characteristics that work together. Accuracy keeps values correct and error-free. Completeness ensures nothing important goes missing. Consistency maintains expected formats and rules across your entire ecosystem. Timeliness delivers fresh data exactly when teams need it. Miss any of these and reliability breaks down. They’re interconnected requirements, not a checklist you can partially complete.

How DevOps informed the modern data reliability movement

As data professionals, we can learn a lot from software engineering when it comes to building robust, highly available systems. In a previous article, I discussed why data reliability is a must-have for data teams, and here, I share how we can apply this concept in practice through engineering operations.

Coined by Google SVP Benjamin Treynor Sloss in the early 2000s, Site Reliability Engineering, a subset of DevOps, refers to “what happens when you ask a software engineer to design an operations function.” In other words, site reliability engineers (SREs for short) build automated software to optimize application uptime while minimizing toil and reducing downtime. On top of these duties, SREs are known as the “firefighters” of the engineering world, working to address hidden bugs, laggy applications, and system outages.

However, while firefighting is certainly a core responsibility, SREs are also charged with finding new ways to thoughtfully manage risk by understanding the opportunity cost for new features and other innovations. To drive this data-driven decision making, establish clear Service Level Objectives (SLOs) that define what this reliability looks like in the real world as measured by Service Level Indicators (SLIs).

Once SLOs and SLIs (say that 10 times fast…) are established, SREs can easily determine this balance between reliability and risk. Even with the smartest solutions and most experienced SREs on tap, achieving 100% system uptime is a non-zero possibility. Innovation relies on iteration, and the only way to eliminate downtime is to stay static, but that doesn’t give you the competitive edge. As one of my SRE friends aptly noted: “it’s not a matter of if the site will go down, it’s a matter of when.”

Firefighting isn’t just for SREs. As data professionals, we also deal with our fair share of data downtime fires, but we don’t have to. Image courtesy of Jay Heike on Unsplash.

Now, as data systems reach similar levels of complexity and higher levels of importance in an organization, we can apply these same concepts to our field as data reliability — an organization’s ability to deliver high data availability and health throughout the entire data life cycle.

In the same way we meticulously manage application downtime, we must focus on reducing data downtime — periods of time when data is inaccurate, missing, or otherwise erroneous.

Common challenges of data reliability

Major companies like GitHub, IBM, DoorDash, and Slack have all suffered high-profile application outages. Data downtime poses an equally serious threat, yet often flies under the radar until the damage is done.

Data downtime stems from issues across your entire data ecosystem. At the data level, information becomes stale, inaccurate, duplicated, or incomplete. At the system level, pipelines fail to scale or break under unexpected loads. At the code level, transformations introduce errors or fail to handle edge cases. The result is the same regardless of the cause. Bad data flows downstream, decisions get made on faulty information, and trust erodes.

Each challenge requires specific strategies to detect, prevent, and resolve. But as your data ecosystem grows, these challenges multiply and interact. Managing data reliability at scale demands more than ad-hoc fixes and manual checks. It requires systematic approaches, automated monitoring, and a culture that treats data incidents with the same urgency as application downtime.

Unexpected schema changes

Upstream sources change schemas without warning, causing pipelines to fail or silently corrupt data. A renamed field, a new mandatory column, or a data type switch can break months of stable operation. Without proper contracts and monitoring between teams, these changes cascade through your data ecosystem causing outages and requiring emergency fixes.

Data drift and anomalies

Data characteristics change over time as business evolves, creating a challenge in distinguishing legitimate changes from quality issues. Seasonal patterns, new user behaviors, or process changes can trigger false alerts while real anomalies hide in the noise. Teams need intelligent monitoring that learns normal variations while catching true problems before they impact decisions.

Pipeline scalability issues

Pipelines designed for startup data volumes often fail catastrophically at enterprise scale. That Python script processing 10,000 records breaks when faced with millions, causing delays, crashes, and missed SLAs. Performance degradation happens gradually then suddenly, turning reliable pipelines into bottlenecks that corrupt data and break trust.

Lack of data ownership and accountability

Critical pipelines run without clear owners, turning minor issues into major incidents. When pipelines fail, teams waste precious time figuring out who should respond and how the system works. Without defined ownership and maintenance plans, technical debt accumulates and reliability degrades until inevitable failures force emergency interventions.

Insufficient monitoring and alerting

Teams often discover data problems only after users complain, by which time bad data has already influenced decisions. Without automated monitoring, tracking data quality metrics and pipeline health, issues compound for days or weeks. Reactive approaches create cycles where problems multiply faster than teams can fix them, eroding trust across the organization.

Each of these issues needs to be monitored and resolved effectively to maintain a healthy and reliable pipeline. But the bigger your data gets, the greater that challenge becomes.

So, in order to detect and resolve bad data effectively in a scaling environment, data teams will need to rely on a modern approach to data quality management.

Data reliability requires modern data quality management

Over the past several years, I’ve spoken with over 150 data leaders about the reliability of data, ranging from a few null values to wholly inaccurate data sets. Their individual issues ran the gamut, but one thing was clear: there was more at stake than a few missing data points.

In the same way that SRE teams are the first to know about application crashes or performance issues, data teams should be the first to know about bad pipelines and data quality issues, too. Only six years ago, data loss and downtime cost companies a cumulative $1.7 trillion annually; in an age where data is ubiquitous and data management tools haven’t necessarily caught up, these numbers have likely gotten worse.

The challenge with traditional data reliability practices like data quality monitoring and data testing is that they don’t extend into the data, system, and code layers to identify why the data is broken, and they generally lack the automation required to make these practices scalable or effective long term. 

To avoid data downtime, it’s important to have full observability over your data throughout the entire lifecycle of data — with automated monitoring all the way from source to consumption. Strong pipelines lead to accurate and timely insights, which allows for better decision making, true governance, and happier customers.

Data observability extends the value of traditional data quality practices into the data, system, and code levels of your data estate to understand not just when but why data breaks; and marries that with resolution features like automated lineage to make those monitors more actionable.

So, no that we understand the tools you need, let’s consider who’s responsible.

Who is responsible for data reliability?

Sometimes a data platform team will manage data reliability as part of the larger environment, sometimes a data reliability engineer, a data observability team, or a data governance team will take responsibility as part of a specialized role. And in the case of a data mesh, oftentimes a domain team will be responsible for data reliability as part of managing their domain pipelines with the help of infrastructure that’s been provided by a centralized platform team. 

However, regardless of who is personally responsible for monitoring and fixing a given problem, data reliability is never the responsibility of a single person or team. Managing data reliability requires cultural support and change supported by scalable data quality management tooling.  

Of course, while data reliability is in some sense the responsibility of the entire team, data reliability management will vary based on the structure of your data team.

How to measure data reliability

You don’t know what you don’t measure. But measuring data reliability goes beyond assessing completeness and accuracy, it’s crucial to move beyond measuring data quality at a specific point in time and space. You need to understand how your data reliability is trending over time.


The best way to measure data reliability is to measure the downtime of a given data asset or table. That means understanding how often that asset is unusable due to some inaccuracy in the data or break in the pipeline. 

Fortunately, the formula to calculate data downtime is relatively simple, but you’ll need to have a few pieces of information handy: the number of incidents over a given time period, how long it takes your team to detect the average incident, and how long it takes your team to resolve the average incident.

Once you have that information, you can follow the formula below:

Number of incidents x (average time to detection + average time to resolution)

Other ways to measure data reliability more specifically include measuring qualitative metrics like consumer satisfaction or how often a data asset is unusable over a given time. To do this, focus on establishing expected levels of quality and service. 

Many data leaders tell us their data scientists and data engineers spend 30 percent or more of their time firefighting data downtime. That’s data that could be much better spent delivering new value for stakeholders. 


So, it’s important to understand how much data downtime you have—and the cost of those broken pipelines over time. Check out the Data Quality ROI guide below for a deeper dive into measuring and how to calculate the dollar value of your data quality practices. 

Data quality vs data reliability

Data quality focuses on the state of your data right now. It’s measured through six dimensions of data accuracy, completeness, consistency, timeliness, validity, and uniqueness. These dimensions tell you whether your data is fit for use at a specific moment. Think of it as a snapshot.

Data reliability is about consistency over time. It asks whether your data maintains its quality standards continuously, across varying conditions and workloads. Data quality checks tell you if your data is correct today. Reliability ensures it stays correct tomorrow, next week, and during your busiest season.

This airline analogy helps clarify this distinction. An airline’s quality might look great today with 95% on-time flights, zero safety incidents, and decent food service. But reliability means maintaining those standards through winter storms, holiday rushes, and equipment changes. A reliable airline delivers consistent service whether it’s a Tuesday in March or Christmas Eve.

The same principle applies to data. Your customer database might be 100% complete and accurate at midnight. That’s quality. But if an upstream pipeline fails and by 9am your data is stale or missing, you have a reliability problem. A reliable data infrastructure prevents or catches that failure, ensuring your data remains accurate when your team needs it.

You can’t have reliable data without quality data. They’re interdependent. Some practitioners even frame reliability as sustained quality over time. While quality asks “Is this data correct?” reliability asks “Will this data be correct every time I need it?”

This shift in thinking has sparked a new discipline called Data Reliability Engineering (DRE). It applies Site Reliability Engineering principles to data pipelines. Teams now set Service Level Objectives for their data, like “99% of ETL jobs complete before 6am” or “data freshness stays under one hour.” They track these metrics just as rigorously as they monitor application uptime.

The move toward DRE reflects how critical data has become. Organizations need data they can depend on, not just data that looks good during quarterly audits. That requires thinking about the processes and tools that keep data healthy continuously, not just point-in-time quality checks.

Data quality vs data reliability. Image courtesy of Shane Murray.
Data quality vs data reliability. Image courtesy of Shane Murray.

How do I make my data reliable?

I propose two primary ways data teams can achieve high data reliability at their organization: 1) set data SLOs and 2) invest in an automated solution that reduces data downtime.

Leverage a data observability platform to set SLOs and SLIs

Setting SLOs and SLIs for system reliability is an expected and necessary function of any SRE team, and in my opinion, it’s about time we applied them to data, too. Some companies are already doing this, too.

In the context of data, SLOs refer to the target range of values a data team hopes to achieve across a given set of SLIs. What your SLOs look like will vary depending on demands of your organization and the needs of your customers. For instance, a B2B cloud storage company may have an SLO of 1 hour or less of downtime per 100 hours of uptime, while a ridesharing service will aim for as much uptime as humanly possible.

Here’s how to think about defining your data SLIs. In previous posts, I’ve discussed the five pillars of data observability. Reframed, these pillars are your five key data SLIs: freshness, distribution, volume, schema, and lineage.

  • Freshness: Data freshness seeks to understand how up-to-date your data tables are, as well as the cadence at which your tables are updated.
  • Distribution: Distribution, in other words, a function of your data’s possible values, tells you if your data is within an accepted range.
  • Volume: Volume refers to the completeness of your data tables and offers insights on the health of your data sources.
  • Schema: Schema changes often indicate broken data.
  • Lineage: Data lineage provides the answer by telling you which upstream sources and downstream ingestors were impacted, as well as which teams are generating the data and who is accessing it.

Many data teams I work with are excited at the prospect of integrating with the latest and greatest data infrastructure and business intelligence tools but, as I’ve written previously, such solutions are only as good as the data that powers them. These SLIs will enable you to better understand how good that data actually is and whether you can trust it.

Shift data reliability left and prevent issues before they occur

While you will never out-architect bad data, you can improve the reliability of data by reducing the number of data incidents.

First identify your key datasets; these are the tables that are queried most frequently and have the largest impact on downstream consumers. Then consider whether it makes sense to deploy preventive techniques such as data contracts or circuit breakers. Data contracts prevent downstream incidents caused by unexpected schema changes while circuit breakers stop the flow of data into the data warehouse if it doesn’t meet set criteria.

These best practices can be time intensive and when used incorrectly may actually introduce more data downtime. That’s why you will want to focus on your production grade pipelines and most valuable data products.

Establish ownership

This two word phrase wins the contest for, “easier said than done.” Most data leaders understand clear lines of ownership enable quick action to be taken and set the stage for accountability. What is routinely underestimated is the need for layers of ownership. 

There needs to be ownership for data quality at the organizational level, domain/project level, and even pipeline/table level. These ownership lines need to be traced back to owners of the business processes they support. There needs to be mechanisms for when these spheres of ownership conflict. If there is a key asset table that is leveraged by data products in both marketing and finance, who owns it? 

Our recommendation for these challenges? 

  • Create a data team RACI for data quality.
  • Create SLAs (more on this later) that establish data and business team ownership for key tables and data products. Hint: data product managers are useful here.
  • Where ownership lines overlap, create clarity by appointing a “custodian” who has primary ownership, but must consult other data team stakeholders for any changes.
RACI for data reliability
RACI for data reliability

Document assets

It’s no data engineer’s favorite task, but mistakes get made when good documentation doesn’t exist. Additionally, documentation enables discovery and ultimately self-service.

The challenge is your ability to pipe data is virtually limitless, but you are constrained by the capacity of humans to make it sustainably meaningful. In other words, you can never document all of your data, and rarely will you have as much documentation as you’d like.

So how can you solve this problem? Some of our suggestions include:

  • Iterate and automate: One company decided to hold off on a column level approach for their documentation since columns are often self-describing. Instead they focused on tables, views, and schemas on their key assets for a “minimum viable product.” They used their own system as well as Monte Carlo to identify key data assets and owners so they knew who to bother about documentation.
  • Tie it to the data onboarding process: Documentation is often everyone’s responsibility and therefore becomes no one’s responsibility. Some of the organizations with best-in-class data asset documentation are those that put mechanisms in place that only allow for key datasets to be added to their warehouse/lake environment once documentation is in place.
  • Measure your documentation levels: Create a dashboard or otherwise track the percentage of your data assets that are appropriately documented. You can also use an overall data product maturity scorecard like the one below from former New York Times senior vice president of data and insights, Shane Murray. One of the key columns for measuring the maturation was the documentation level.
  • Leverage and integrate with other sources: There are multiple solutions inside and outside of the modern data stack that can contain helpful context on data assets. For example, one company has integrated documentation coming in from dbt within the Catalog plane of the Monte Carlo platform. Others leverage data catalog solutions (some of which are experimenting with Chat-GPT for automating documentation ?) Another solution is to integrate with or train data consumers to leverage what is typically an unused goldmine of information about a company’s data: Slack. A quick Slack search for a table name before asking a question can save tons of time.

Check out how Cribl approaches documentation and creating a data driven culture.

Conduct post-mortems

Post-mortems can contribute greatly to data reliability as they help ensure incidents don’t happen again. A few post mortem best practices include: 

  • Frame everything as a learning experience: To be constructive, post-mortems must be blameless (or if not, blame aware). 
  • Use this as an opportunity to assess your readiness for future incidents: Update runbooks and make adjustments to your monitoring, alerting, and workflow management tools. 
  • Document each post-mortem and share with the broader data team: Documentation is just as important as any other step in the incident management process because it prevents knowledge gaps from accruing if engineers with tribal knowledge leave the team or aren’t available to help.

We would also recommend these post mortem templates from Github.

Use data health insights to understand hot spots

Pareto’s principle–80% of the consequences are the result of 20% of actions– is alive and well in the data engineering field. In this specific case, it is data quality issues and tables/pipelines.

In other words, it’s a good rule of thumb to assume 20% of your tables are creating 80% of your data quality issues. Cross-referencing those problematic hot-spots with your list of key assets is a good place to concentrate your investment of data quality resources. While that can be easier said than done, items like a data reliability dashboard can make this optimization a reality.

data reliability dashboard
Monte Carlo’s table health data reliability dashboard. 

You don’t have to limit Pareto’s principle to tables either. You can splice and dice data quality metrics by domains because rest assured there is one department within your organization that is on fire more frequently than the rest.

Monte Carlo’s data reliability dashboard can show you which domains have the most incidents or slowest time to response.
Monte Carlo’s data reliability dashboard can show you which domains have the most incidents or slowest time to response.

This focus and data quality optimization was the approach taken by Brandon Beidle, Red Ventures Director of Data Product. He said:

“The next layer is measuring performance. How well are the systems performing? If there are tons of issues, then maybe we aren’t building our system in an effective way. Or, it could tell us where to optimize our time and resources. Maybe 6 of our 7 warehouses are running smoothly, so let’s take a closer look at the one that isn’t. Having a record also evolved the evaluations of our data team from a feeling, ‘I feel the team is/isn’t doing well,’ to something more evidence-based.”

Data health insights can also help optimize your data operations by preventing table versioning issues or identifying degrading queries before they become problematic.

Invest in data reliability

The truth is — in one way or another — you already are investing in data reliability. Whether it’s through manual work your team is doing to verify data, custom validation rules your engineers are writing, or simply the cost of decisions made based on broken data or silent errors that went unnoticed. And it’s a hell of a price to pay.

But there is a better way. In the same way that site reliability engineers use automation to ensure application uptime and improve their efficiency, data teams should also rely on machine learning-enabled platforms to make data reliability easier and more accessible — leading to better decisions, better trust, and better outcomes.

Like any good SRE solution, the strongest data reliability platform will give you automated, scalable, ML-driven observability into your pipelines — making it easy to instrument, monitor, alert, troubleshoot, resolve, and collaborate on data issues — ultimately reducing your data downtime rates to begin with and thereby increasing the reliability of your overall data pipelines.

Now, with clear SLIs, SLOs, and a new approach for data reliability in tow, we can finally leave firefighting to the professionals.

If you want to learn more about the reliability of data, reach out to Barr Moses.