In June 2020, it was reported that bad data hampered the U.S. government’s ability to roll out its COVID-19 economic recovery programs. In addition to other grievous errors, this data downtime incident sent over $1.4 billion in COVID-19 stimulus checks to dead people.
Data downtime — periods of time when data is partial, erroneous, missing or otherwise inaccurate — isn’t just a problem for the federal government. Almost every organization struggles with it.
In part one of a two-part series, I propose a solution to data downtime: data reliability, a concept borrowed from Site Reliability Engineering (SRE) that has been adopted by some of the best data teams in the industry.
How do we solve for data downtime?
I was catching up with a VP of Data I think highly of at a popular public tech company the other day who told me about the impact of data downtime on his company from financial reporting and regulatory reporting to marketing analytics and even customer engagement metrics.
He was jaded by traditional data quality methods as an antidote for solving issues with data.
Data quality checks only go so far,” he said (and yes, agreed to be quoted anonymously). “I want something that will keep me in the know about data downtime before anyone else — including my boss — knows. Seriously, let me put it this way: I see this as the ‘keep our CFO out of jail’ priority.
He’s not alone. Over the past several years, I’ve spoken with hundreds of data leaders about their data downtime issues, ranging from a few null values to wholly inaccurate data sets. Their individual issues ran the gamut, from wasted time (a no brainer) to wasted money, and even significant compliance risks.
To solve for data downtime, I propose an approach that leverages some best practices of our friends, the “bad software” wranglers: site reliability engineers.
The rise of SRE
Since the early 2000s, Site Reliability Engineering (SRE) teams at Google (where the term originated) and other companies have been critical not just for fixing outages, but preventing them in the first place by building scalable and highly available systems. As software systems became increasingly complicated, however, engineers developed novel, automated ways of scaling and operationalizing their tech stacks to balance their dual needs for reliability and innovation.
Site reliability engineers (SREs) are often depicted as brave firefighters, paged at all hours of the night to address hidden bugs, laggy applications, and system outages. On top of those, SRE teams help automate processes that facilitate seamless software deployment, configuration management, monitoring, and metrics through automated solutions that eliminate toil and minimize application downtime in the first place.
Reliability for data teams
In Site Reliability Engineering, the phrase “hope is not a strategy” is a popular one. It informs the SRE ethos that systems do not run themselves, and that behind every piece of software is an engineer who can, to the best of their ability, ensure some measure of reliability.
Hope will not save your company from making decisions with the wrong numbers. Data reliability will.
In the same way that SRE teams are the first to know about application crashes or performance issues, data engineering and operations teams should be the first to know about bad pipelines and data downtime issues, too. Only six years ago, data downtime cost companies a cumulative $1.7 trillion annually; in an age where data is ubiquitous and data management tools haven’t necessarily caught up, these numbers have likely gotten worse.
To make data fully available, trusted, and self-serve, however, data teams must focus on reducing data downtime by achieving full reliability.
It’s no doubt that this new approach is a game changer for the industry, and I’m excited to see companies joining the data reliability movement. After all: who needs hope when you can trust your data?
If you want to learn more, reach out to Barr Moses.