A Matter of Trust

We didn’t start the fire. It was the schema change.

Many years ago, an exec approached me after a contentious meeting and asked, “Shane, so is the data trustworthy?” 

Perhaps you can relate.

My response at the time probably did not build confidence: “Some of it, if not precise, is at least directionally useful.”

I’ve been pondering this question and my unsatisfying response recently as I talk to data leaders about how to communicate data reliability, whether that be to executives or to the end-users of their data products. 

For data teams, trust is everything. Unfortunately, trust is also often a lagging indicator of quality and reliability. Trust is often assumed until it’s lost, usually following a major incident.

In most cases, data quality was likely objectively in decline behind the curtains far before that data incident occurred. Conversely, major improvements in data quality may also go unnoticed, and data trust will be rebuilt slowly following such an incident. 

The relationship between data reliability, trust, and incidents may look something like this:

When a data exec recently told me that he roughly measures trust and quality by “days since the last major incident” it struck a chord. Data incidents are the events that undermine trust, not just in your data but in your entire strategy, product, or team. 

Just like mechanical engineers look for signs their machines need preventive maintenance to avoid costly breakdowns, data engineers need to monitor indicators of data reliability to understand when proactive steps are needed to avoid costly data incidents. 

You don’t want to be in a situation where you are repairing the pipeline after its burst and the damage is done.  And damage can be done. For example, Unity, the popular gaming software company, cited “bad data” for a $110M impact on their ads business. 

But if trust is the important lagging indicator, then what is the best leading indicator?

The reliability requirements for a specific data product are subject to the type of the data, how it’s used and who uses it. Some data must be highly available (low latency) but accuracy is less critical, such as the data for content or product recommendations. Other data can be delayed without the loss of trust, but must be deadly accurate when delivered, such as financial or health data. 

This is why understanding the business objective and talking to stakeholders when building your data product SLAs is so important.

For simplicity’s sake, let’s segment our data products into three classes in order to address the different expectations for reliability: 

Data downtime, the number of incidents x the time to detection + the time to resolution, is a helpful metric for overall data quality. 

“Data uptime” SLAs drill down to another level of detail by indicating the health of our data, based on the specific reliability goals we care about (freshness, accuracy, etc), for the specific data products that are most consequential to our business. 

Then, we might set the following SLAs*

This measure is: 

  • Explainable (“data uptime, got it!”), 
  • Trendable (“data uptime increased 5% this quarter”) and 
  • Comparable with context (“dataset A with 95% uptime is more reliable than dataset B with 88% uptime,” and both have the same SLAs). 

Typically, early gains in uptime (or reductions in downtime) will come from the effectiveness of responding to incidents, reducing the time to detect and resolve. After these improvements, data teams will advance towards targeting the systematic weaknesses that cause incidents, driving further gains in uptime. 

*some teams may decide to get even more granular with separate metrics for availability and accuracy depending on the data product.

Focusing on what matters most

The complexity of data warehouses – many domains, thousands of tables – will invariably require a simple distillation of data uptime metrics. 

All data incidents are not created equal, some are more severe than others and this severity will impact the loss of trust resulting from an incident. But incident severity is another lagging indicator, so what would be the best way to account for it within the leading indicator, data uptime? 

Assigning an importance weight to each table based on its usage and criticality to the business can give you a weighted uptime % for each data domain.

This leads us to another potential path to downtime optimization – cleaning up tables of “low importance” in the warehouse that are contributing to downtime, thereby driving up your overall uptime. 

With detailed SLAs we can understand our data reliability levels and fix issues BEFORE they turn into the data incidents that compromise trust. If nothing else, when it’s your turn to have an executive ask you, “how trustworthy is the data,” you can provide an appropriately data-driven response.

– shane

Trying to figure out how to build up data trust in your organization? Curious about how to measure leading indicators of data quality? Talk to us by filling out the form below.