As data professionals, we can learn a lot from software engineering when it comes to building robust, highly available systems. In a previous article, I discussed why data reliability is must-have for data teams, and here, I share how we can apply this concept in practice through engineering operations.

Coined by Google SVP Benjamin Treynor Sloss in the early 2000s, Site Reliability Engineering, a subset of DevOps, refers to “what happens when you ask a software engineer to design an operations function.” In other words, site reliability engineers (SREs for short) build automated software to optimize application uptime while minimizing toil and reducing downtime. On top of these duties, SREs are known as the “firefighters” of the engineering world, working to address hidden bugs, laggy applications, and system outages.

Now, as data systems reach similar levels of complexity and higher levels of importance in an organization, we can apply these same concepts to our field as data reliability — an organization’s ability to deliver high data availability and health throughout the entire data life cycle.

From application downtime to data downtime

While firefighting is certainly a core responsibility, SREs are also charged with finding new ways to thoughtfully manage risk by understanding the opportunity cost for new features and other innovations. To drive this data-driven decision making, establish clear Service Level Objectives (SLOs) that define what this reliability looks like in the real world as measured by Service Level Indicators (SLIs).

Once SLOs and SLIs (say that 10 times fast…) are established, SREs can easily determine this balance between reliability and risk. Even with the smartest solutions and most experienced SREs on tap, achieving 100% system uptime is a non-zero possibility. Innovation relies on iteration, and the only way to eliminate downtime is to stay static, but that doesn’t give you the competitive edge. As one of my SRE friends aptly noted: “it’s not a matter of if the site will go down, it’s a matter of when.”

Much in the same way that SREs strike this balance between reliability and innovation, we must also ensure that our data pipelines are both reliable and flexible enough to allow for the introduction of new data sources, business logic, transformations, and other variables beneficial to both our companies and our customers.

In the same way we meticulously manage application downtime, we must focus on reducing data downtime — periods of time when data is inaccurate, missing, or otherwise erroneous.

There have been a number of major application downtime outages for companies as varied as GitHubIBMDoorDash, and Slack — and data downtime is a similarly serious threat.

Firefighting isn’t just for SREs. As data professionals, we also deal with our fair share of data downtime fires, but we don’t have to. Image courtesy of Jay Heike on Unsplash.

Not only does bad data lead to poor decision making, but monitoring for and solving data reliability issues can cost teams valuable time and money. If you’re in data, you probably know how much time is spent on firefighting data downtime. In fact, many data leaders tell us their data scientists and data engineers spend 30 percent or more of their time tackling data issues — energy better spent innovating.

Know before anyone else does

Over the past several years, I’ve spoken with over 150 data leaders about their data downtime, ranging from a few null values to wholly inaccurate data sets. Their individual issues ran the gamut, but one thing was clear: there was more at stake than a few missing data points.

One VP of Engineering at a popular high-end clothing rental company told me that before his team started monitoring for data downtime, their entire database of customer information was 8-hours off, revealing massive tech debt. Making matters worse, they didn’t catch this issue for several months, only identifying it during a data warehouse migration. While it ended up being a relatively simple fix (and an embarrassing discovery), it would have been good to know and resolve ASAP.

Data downtime took a toll on their business. Analysts that relied on timely data to make informed decisions for their customers lacked confidence in their pipelines. Loss of revenue ensued. These kinds of incidents happen too often — and no company is spared.

In the same way that SRE teams are the first to know about application crashes or performance issues, data teams should be the first to know about bad pipelines and data quality issues, too. Only six years ago, data loss and downtime cost companies a cumulative $1.7 trillion annually; in an age where data is ubiquitous and data management tools haven’t necessarily caught up, these numbers have likely gotten worse.

To avoid data downtime, it’s important to have full observability over your data throughout the entire lifecycle of data — all the way from source to consumption. Strong pipelines lead to accurate and timely insights, which allows for better decision making, true governance, and happier customers.

How do I make my data reliable?

I propose two primary ways data teams can achieve high data reliability at their organization: 1) set data SLOs and 2) invest in an automated solution that reduces data downtime.

Set SLOs and SLIs for data

Setting SLOs and SLIs for system reliability is an expected and necessary function of any SRE team, and in my opinion, it’s about time we applied them to data, too. Some companies are already doing this, too.

In the context of data, SLOs refer to the target range of values a data team hopes to achieve across a given set of SLIs. What your SLOs look like will vary depending on demands of your organization and the needs of your customers. For instance, a B2B cloud storage company may have an SLO of 1 hour or less of downtime per 100 hours of uptime, while a ridesharing service will aim for as much uptime as humanly possible.

Here’s how to think about defining your data SLIs. In previous posts, I’ve discussed the five pillars of data observability. Reframed, these pillars are your five key data SLIs: freshness, distribution, volume, schema, and lineage.

  • Freshness: Freshness seeks to understand how up-to-date your data tables are, as well as the cadence at which your tables are updated.
  • Distribution: Distribution, in other words, a function of your data’s possible values, tells you if your data is within an accepted range.
  • Volume: Volume refers to the completeness of your data tables and offers insights on the health of your data sources.
  • Schema: Schema changes often indicate broken data.
  • Lineage: Data lineage provides the answer by telling you which upstream sources and downstream ingestors were impacted, as well as which teams are generating the data and who is accessing it.

Many data teams I work with are excited at the prospect of integrating with the latest and greatest data infrastructure and business intelligence tools but, as I’ve written previously, such solutions are only as good as the data that powers them. These SLIs will enable you to better understand how good that data actually is and whether you can trust it.

Invest in data reliability

The truth is — in one way or another — you already are investing in data reliability. Whether it’s through manual work your team is doing to verify data, custom validation rules your engineers are writing, or simply the cost of decisions made based on broken data or silent errors that went unnoticed. And it’s a hell of a price to pay.

But there is a better way. In the same way that site reliability engineers use automation to ensure application uptime and improve their efficiency, data teams should also rely on machine learning-enabled platforms to make data reliability easier and more accessible — leading to better decisions, better trust, and better outcomes.

Like any good SRE solution, the strongest data reliability platform will give you automated, scalable, ML-driven observability into your pipelines — making it easy to instrument, monitor, alert, troubleshoot, resolve, and collaborate on data issues — ultimately reducing your data downtime rates to begin with and thereby increasing the reliability of your overall data pipelines.

Now, with clear SLIs, SLOs, and a new approach for data reliability in tow, we can finally leave firefighting to the professionals.

If you want to learn more, reach out to Barr Moses.