Data Observability

What is Data Reliability?

data-reliability
Barr Moses headshot

Barr Moses

CEO and Co-founder, Monte Carlo. Proponent of data reliability and action movies.

Data reliability goes beyond assessing completeness and accuracy; it necessitates considering how data quality evolves across diverse real-world conditions over time. When addressing data reliability, it is crucial to move beyond measuring data quality at a specific point in time and space. Instead, focus on establishing expected levels of quality and service. This includes determining how promptly you communicate and respond to incidents, as well as equipping yourself with the necessary toolkit to swiftly diagnose and resolve data-related issues.

As data professionals, we can learn a lot from software engineering when it comes to building robust, highly available systems. In a previous article, I discussed why data reliability is must-have for data teams, and here, I share how we can apply this concept in practice through engineering operations.

Coined by Google SVP Benjamin Treynor Sloss in the early 2000s, Site Reliability Engineering, a subset of DevOps, refers to “what happens when you ask a software engineer to design an operations function.” In other words, site reliability engineers (SREs for short) build automated software to optimize application uptime while minimizing toil and reducing downtime. On top of these duties, SREs are known as the “firefighters” of the engineering world, working to address hidden bugs, laggy applications, and system outages.

Now, as data systems reach similar levels of complexity and higher levels of importance in an organization, we can apply these same concepts to our field as data reliability — an organization’s ability to deliver high data availability and health throughout the entire data life cycle.

Table of Contents

Data reliability through applying SRE/DevOps principles

While firefighting is certainly a core responsibility, SREs are also charged with finding new ways to thoughtfully manage risk by understanding the opportunity cost for new features and other innovations. To drive this data-driven decision making, establish clear Service Level Objectives (SLOs) that define what this reliability looks like in the real world as measured by Service Level Indicators (SLIs).

Once SLOs and SLIs (say that 10 times fast…) are established, SREs can easily determine this balance between reliability and risk. Even with the smartest solutions and most experienced SREs on tap, achieving 100% system uptime is a non-zero possibility. Innovation relies on iteration, and the only way to eliminate downtime is to stay static, but that doesn’t give you the competitive edge. As one of my SRE friends aptly noted: “it’s not a matter of if the site will go down, it’s a matter of when.”

Much in the same way that SREs strike this balance between reliability and innovation, we must also ensure that our data pipelines are both reliable and flexible enough to allow for the introduction of new data sources, business logic, transformations, and other variables beneficial to both our companies and our customers.

In the same way we meticulously manage application downtime, we must focus on reducing data downtime — periods of time when data is inaccurate, missing, or otherwise erroneous.

There have been a number of major application downtime outages for companies as varied as GitHubIBMDoorDash, and Slack — and data downtime is a similarly serious threat.

Firefighting isn’t just for SREs. As data professionals, we also deal with our fair share of data downtime fires, but we don’t have to. Image courtesy of Jay Heike on Unsplash.

Not only does bad data lead to poor decision making, but monitoring for and solving data reliability issues can cost teams valuable time and money. If you’re in data, you probably know how much time is spent on firefighting data downtime. In fact, many data leaders tell us their data scientists and data engineers spend 30 percent or more of their time tackling data issues — energy better spent innovating.

Catch data reliability issues before anyone else does

Over the past several years, I’ve spoken with over 150 data leaders about the reliability of data, ranging from a few null values to wholly inaccurate data sets. Their individual issues ran the gamut, but one thing was clear: there was more at stake than a few missing data points.

One VP of Engineering at a popular high-end clothing rental company told me that before his team started monitoring for data downtime, their entire database of customer information was 8-hours off, revealing massive tech debt. Making matters worse, they didn’t catch this issue for several months, only identifying it during a data warehouse migration. While it ended up being a relatively simple fix (and an embarrassing discovery), it would have been good to know and resolve ASAP.

Data reliability took a toll on their business. Analysts that relied on timely data to make informed decisions for their customers lacked confidence in their pipelines. Loss of revenue ensued. These kinds of incidents happen too often — and no company is spared.

In the same way that SRE teams are the first to know about application crashes or performance issues, data teams should be the first to know about bad pipelines and data quality issues, too. Only six years ago, data loss and downtime cost companies a cumulative $1.7 trillion annually; in an age where data is ubiquitous and data management tools haven’t necessarily caught up, these numbers have likely gotten worse.

To avoid data downtime, it’s important to have full observability over your data throughout the entire lifecycle of data — all the way from source to consumption. Strong pipelines lead to accurate and timely insights, which allows for better decision making, true governance, and happier customers.

Data quality vs data reliability

Data quality is often expressed in the six dimensions of accuracy, completeness, consistency, timeliness, validity, and uniqueness. Those six dimensions data quality typically measure the data and it’s fitness for a specific use at a specific moment in time.

Data reliability requires you to think beyond a point in time, and consider how the quality changes over time in a variety of real-world conditions. 

For example, the quality of an airline might be measured based on its timeliness (percent of flights on-time), safety (major incidents) and food service approximating that of a diner. But in order for the airline to be reliable, it’s expected to maintain those levels of quality consistently over time, across various routes, weather conditions and holiday weekends. 

Similarly, a data product’s quality might be assessed by its availability at 9am, the completeness of records, and its consistency versus a source-of-record. For it to be reliable, you must assess whether it maintains these service levels over time, across holiday traffic spikes and product launches.  

In solving for data reliability you must not simply measure data quality (at a point in time and space), but also establish expected levels of quality and service (i.e. how quickly you’ll communicate and respond to incidents), and have the toolkit to rapidly diagnose and resolve data incidents. 

Data quality vs data reliability. Image courtesy of Shane Murray.
Data quality vs data reliability. Image courtesy of Shane Murray.

How do I make my data reliable?

I propose two primary ways data teams can achieve high data reliability at their organization: 1) set data SLOs and 2) invest in an automated solution that reduces data downtime.

Leverage a data observability platform to set SLOs and SLIs

Setting SLOs and SLIs for system reliability is an expected and necessary function of any SRE team, and in my opinion, it’s about time we applied them to data, too. Some companies are already doing this, too.

In the context of data, SLOs refer to the target range of values a data team hopes to achieve across a given set of SLIs. What your SLOs look like will vary depending on demands of your organization and the needs of your customers. For instance, a B2B cloud storage company may have an SLO of 1 hour or less of downtime per 100 hours of uptime, while a ridesharing service will aim for as much uptime as humanly possible.

Here’s how to think about defining your data SLIs. In previous posts, I’ve discussed the five pillars of data observability. Reframed, these pillars are your five key data SLIs: freshness, distribution, volume, schema, and lineage.

  • Freshness: Freshness seeks to understand how up-to-date your data tables are, as well as the cadence at which your tables are updated.
  • Distribution: Distribution, in other words, a function of your data’s possible values, tells you if your data is within an accepted range.
  • Volume: Volume refers to the completeness of your data tables and offers insights on the health of your data sources.
  • Schema: Schema changes often indicate broken data.
  • Lineage: Data lineage provides the answer by telling you which upstream sources and downstream ingestors were impacted, as well as which teams are generating the data and who is accessing it.

Many data teams I work with are excited at the prospect of integrating with the latest and greatest data infrastructure and business intelligence tools but, as I’ve written previously, such solutions are only as good as the data that powers them. These SLIs will enable you to better understand how good that data actually is and whether you can trust it.

Shift data reliability left and prevent issues before they occur

While you will never out-architect bad data, you can improve the reliability of data by reducing the number of data incidents.

First identify your key datasets; these are the tables that are queried most frequently and have the largest impact on downstream consumers. Then consider whether it makes sense to deploy preventive techniques such as data contracts or circuit breakers. Data contracts prevent downstream incidents caused by unexpected schema changes while circuit breakers stop the flow of data into the data warehouse if it doesn’t meet set criteria.

These best practices can be time intensive and when used incorrectly may actually introduce more data downtime. That’s why you will want to focus on your production grade pipelines and most valuable data products.

Establish ownership

This two word phrase wins the contest for, “easier said than done.” Most data leaders understand clear lines of ownership enable quick action to be taken and set the stage for accountability. What is routinely underestimated is the need for layers of ownership. 

There needs to be ownership for data quality at the organizational level, domain/project level, and even pipeline/table level. These ownership lines need to be traced back to owners of the business processes they support. There needs to be mechanisms for when these spheres of ownership conflict. If there is a key asset table that is leveraged by data products in both marketing and finance, who owns it? 

Our recommendation for these challenges? 

  • Create a data team RACI for data quality.
  • Create SLAs (more on this later) that establish data and business team ownership for key tables and data products. Hint: data product managers are useful here.
  • Where ownership lines overlap, create clarity by appointing a “custodian” who has primary ownership, but must consult other data team stakeholders for any changes.
RACI for data reliability
RACI for data reliability

Document assets

It’s no data engineer’s favorite task, but mistakes get made when good documentation doesn’t exist. Additionally, documentation enables discovery and ultimately self-service.

The challenge is your ability to pipe data is virtually limitless, but you are constrained by the capacity of humans to make it sustainably meaningful. In other words, you can never document all of your data, and rarely will you have as much documentation as you’d like.

So how can you solve this problem? Some of our suggestions include:

  • Iterate and automate: One company decided to hold off on a column level approach for their documentation since columns are often self-describing. Instead they focused on tables, views, and schemas on their key assets for a “minimum viable product.” They used their own system as well as Monte Carlo to identify key data assets and owners so they knew who to bother about documentation.
  • Tie it to the data onboarding process: Documentation is often everyone’s responsibility and therefore becomes no one’s responsibility. Some of the organizations with best-in-class data asset documentation are those that put mechanisms in place that only allow for key datasets to be added to their warehouse/lake environment once documentation is in place.
  • Measure your documentation levels: Create a dashboard or otherwise track the percentage of your data assets that are appropriately documented. You can also use an overall data product maturity scorecard like the one below from former New York Times senior vice president of data and insights, Shane Murray. One of the key columns for measuring the maturation was the documentation level.
  • Leverage and integrate with other sources: There are multiple solutions inside and outside of the modern data stack that can contain helpful context on data assets. For example, one company has integrated documentation coming in from dbt within the Catalog plane of the Monte Carlo platform. Others leverage data catalog solutions (some of which are experimenting with Chat-GPT for automating documentation ?) Another solution is to integrate with or train data consumers to leverage what is typically an unused goldmine of information about a company’s data: Slack. A quick Slack search for a table name before asking a question can save tons of time.

Check out how Cribl approaches documentation and creating a data driven culture.

Conduct post-mortems

Post-mortems can contribute greatly to data reliability as they help ensure incidents don’t happen again. A few post mortem best practices include: 

  • Frame everything as a learning experience: To be constructive, post-mortems must be blameless (or if not, blame aware). 
  • Use this as an opportunity to assess your readiness for future incidents: Update runbooks and make adjustments to your monitoring, alerting, and workflow management tools. 
  • Document each post-mortem and share with the broader data team: Documentation is just as important as any other step in the incident management process because it prevents knowledge gaps from accruing if engineers with tribal knowledge leave the team or aren’t available to help.

We would also recommend these post mortem templates from Github.

Use data health insights to understand hot spots

Pareto’s principle–80% of the consequences are the result of 20% of actions– is alive and well in the data engineering field. In this specific case, it is data quality issues and tables/pipelines.

In other words, it’s a good rule of thumb to assume 20% of your tables are creating 80% of your data quality issues. Cross-referencing those problematic hot-spots with your list of key assets is a good place to concentrate your investment of data quality resources. While that can be easier said than done, items like a data reliability dashboard can make this optimization a reality.

data reliability dashboard
Monte Carlo’s table health data reliability dashboard. 

You don’t have to limit Pareto’s principle to tables either. You can splice and dice data quality metrics by domains because rest assured there is one department within your organization that is on fire more frequently than the rest.

Monte Carlo’s data reliability dashboard can show you which domains have the most incidents or slowest time to response.
Monte Carlo’s data reliability dashboard can show you which domains have the most incidents or slowest time to response.

This focus and data quality optimization was the approach taken by Brandon Beidle, Red Ventures Director of Data Product. He said:

“The next layer is measuring performance. How well are the systems performing? If there are tons of issues, then maybe we aren’t building our system in an effective way. Or, it could tell us where to optimize our time and resources. Maybe 6 of our 7 warehouses are running smoothly, so let’s take a closer look at the one that isn’t. Having a record also evolved the evaluations of our data team from a feeling, ‘I feel the team is/isn’t doing well,’ to something more evidence-based.”

Data health insights can also help optimize your data operations by preventing table versioning issues or identifying degrading queries before they become problematic.

Invest in data reliability

The truth is — in one way or another — you already are investing in data reliability. Whether it’s through manual work your team is doing to verify data, custom validation rules your engineers are writing, or simply the cost of decisions made based on broken data or silent errors that went unnoticed. And it’s a hell of a price to pay.

But there is a better way. In the same way that site reliability engineers use automation to ensure application uptime and improve their efficiency, data teams should also rely on machine learning-enabled platforms to make data reliability easier and more accessible — leading to better decisions, better trust, and better outcomes.

Like any good SRE solution, the strongest data reliability platform will give you automated, scalable, ML-driven observability into your pipelines — making it easy to instrument, monitor, alert, troubleshoot, resolve, and collaborate on data issues — ultimately reducing your data downtime rates to begin with and thereby increasing the reliability of your overall data pipelines.

Now, with clear SLIs, SLOs, and a new approach for data reliability in tow, we can finally leave firefighting to the professionals.

If you want to learn more about the reliability of data, reach out to Barr Moses.