Data Quality Metrics  — You’re Measuring It Wrong.

One of our customers recently posed this question related to data quality metrics:

I would like to set up an OKR for ourselves [the data team] around data availability. I’d like to establish a single data quality KPI that would summarize availability, freshness, quality.

What’s the best way to do this?

I can’t tell you how much joy this request brought me. As someone who is obsessed with data quality KPIs— yeah, you read that right: instead of sheep, I dream about null values and data freshness these days — this is a dream come true.

Why do data quality metrics matter?

If you’re in data, you’re either currently working on a data quality project or you just wrapped one up. It’s the law of bad data — there’s always more of it.

Traditional methods of measuring data quality metrics are often time and resource-intensive, spanning several variables, from accuracy (a no-brainer) and completeness, to validity and timeliness (in data, there’s no such thing as being fashionably late). But the good news is there’s a better way to approach data quality metrics.

Data downtime — periods of time when your data is partial, erroneous, missing, or otherwise inaccurate — is an important data quality metric for any company striving to be data-driven.

It might sound cliché, but it’s true — we work hard to collect, track, and use data, but so often we have no idea if the data is actually accurate. In fact, companies frequently end up having excellent data pipelines, but terrible data. So what’s all this hard work to set up a fancy data architecture worth if at the end of the day, we can’t actually use the data?

By measuring data downtime, this simple data quality KPI will help you determine the reliability of your data, giving you the confidence necessary to use it or lose it.

So you want a data quality KPI for it?

Overall, data downtime is a function of the following data quality metrics:

  • Number of data incidents (N) — This factor is not always in your control given that you rely on data sources “external” to your team, but it’s certainly a driver of data uptime.
  • Time-to-detection (TTD) — In the event of an incident, how quickly are you alerted? In extreme cases, this quantity can be measured in months if you don’t have the proper methods for detection in place. Silent errors made by bad data can result in costly decisions, with repercussions for both your company and your customers.
  • Time-to-resolution (TTR) — Following a known incident, how quickly were you able to resolve it?

By this method, a data incident refers to a case where a data product (e.g., a Looker report) is “incorrect,” which could be a result of a number of root causes, including:

  • All/parts of the data are not sufficiently up-to-date
  • All/parts of the data are missing/duplicated
  • Certain fields are missing/incorrect

Here are some examples of things that are not a data incident:

  • A planned schema change that does not “break” any downstream data
  • A table that stops updating as a result of an intentional change to the data system (deprecation)

Bringing this all together, I’d propose the right formula for data downtime is:

Data downtime is an effective data quality metric and a very simple data quality KPI. It is measured by the number of data incidents multiplied by the average time to detection plus the average time to resolution.
Data downtime is an effective data quality metric. It is measured by the number of data incidents multiplied by the average time to detection plus the average time to resolution.

If you want to take this data quality KPI a step further, you could also categorize incidents by severity and weight uptime by level of severity, but for simplicity’s sake, we’ll save that for a later post.

With the right combination of automation, advanced detection, and seamless resolution, you can minimize data downtime by reducing TTD and TTR. There are even ways to reduce N, which we’ll discuss in future posts (spoiler: it’s about getting the right visibility to prevent data incidents in the first place).

Measuring data downtime is the first step in understanding data quality, and from there, ensuring data reliability. With fancy algorithms and data quality KPIs flying all over the place, it’s easy to overcomplicate how to measure data quality.

While there are benefits to following the 6 dimensions of data quality (timeliness, accuracy, completeness, consistency, uniqueness, and validity), sometimes the simplest way is the best way.

If you want to learn more, reach out to Barr Moses. Or book a time to speak with us below.