Data Observability, Data Reliability

How ELT Schedules Can Improve Root Cause Analysis For Data Engineers

Ryan Kearns

Ryan is a data scientist at Monte Carlo.

“Correlation doesn’t imply causation, but it does waggle its eyebrows suggestively and gesture furtively while mouthing ‘look over there’” – Randall Munroe

In this article, Ryan Kearns, co-author of O’Reilly’s Data Quality Fundamentals and a data scientist at Monte Carlo, discusses the limitations of segmentation analysis when it comes to root cause analysis for data teams, and proposes a better approach: ELT schedules as Bayesian Networks. 

Data engineering and root cause analysis

“Root Cause Analysis” (RCA), a procedure for defining the fundamental explanation of an issue, is a bit of a sanctified term in the software world.

Ongoing issues don’t really end until the root cause is established. Whether you’re dealing with a broken API call, a website that’s not loading, or a metric that’s gone awry, identifying the root cause is always fundamental in the resolution process. At that point, engineers can explain the source of the failure, work towards shipping a fix, and mitigate the issue in the future.

In the new field of data observability, RCA is equally essential. The fastest way to make bad data good is to understand why it went bad in the first place. After all, once we detect the data is bad, the first thing we want to know is why.

Why are 10% of the rows suddenly filled with NULL values in your all-important marketing spend table? Why is the customer_uuid field suddenly showing tons of duplicates when it is supposed to be unique?

However, we data professionals are starting to use “RCA” to refer to tools that assist in RCA workflows, not tools that do any actual RCA themselves. This may sound trivial but the difference is significant. For example, many data quality RCA approaches use segmentation analysis: identifying responsible segments of the data that give better context to the incident.

Segmentation analysis to expose that a certain customer ID is highly overrepresented in a duplication incident. Helpful? Yes. But it should be used in conjunction with other anomaly resolution analyses.  
Segmentation analysis to expose that a certain customer ID is highly overrepresented in a duplication incident. Helpful? Yes. But it should be used in conjunction with other anomaly resolution analyses.  Image courtesy of Monte Carlo.

Segmentation analysis is an excellent correlative feature to give greater context to a data incident. Often, segmentation analysis may point you towards the root cause of an issue. 

This can occur when:

  1. The on-call engineer has lots of context about the table in question.
  2. The segmentation analysis is sparse – that is, it returns only select correlations that a human could reasonably parse.
  3. The segmentation analysis is meaningful – that is, the correlations returned make actual sense and tell a story.

In the above case, upstream changes in an ELT job on several customers’ data sources caused an important metric for our machine learning models to drift significantly from the baseline. Segmentation analysis allowed us to attribute the drift to that specific upstream change. 

Notice that there’s nothing in the segmentation analysis itself that identifies the root cause – an engineer with knowledge of the change was on-scene to connect the dots. Specifically here, note that one RESOURCE_ID figures in 4% of all records, but in 38% of the records with anomalous importance_scores. Turns out, this problem occurred with just three customers with the same edge-case integration setup, so we could quickly roll back the recent change to those environments.

Where correlation breaks down

But let’s think about a slightly different problem. What if, instead of just an ELT change for these three customers, we had a failure affecting the whole ELT pipeline? 

If a table fails to update, one bad SQL LEFT JOIN could result in dozens of downstream columns turning NULL all at the same time. If everything in the downstream table has caught fire, segmentation analysis won’t do anything to help us. 

We might see hundreds of notifications like “Field X is 100% correlated with field Y!” – and maybe one of these correlations hints at the actual root cause. More likely than not, you’ll be dealing with countless comorbidities of the same upstream symptom.

There’s a deeper point about what’s missing here, though. Segmentation analysis assumes that just looking at one table constitutes sufficient RCA. In any modern cloud data infrastructure, though, no table ever stands alone. 

Tables rarely contain the root causes of their own data downtime. ELT pipelines time out, schemas are unwittingly changed, API functionality deprecates. We have seen this reality in our own data about data observability. We find that our “High Correlation Insights” feature only applies to ~4.4% of the anomalies we catch.

Data quality issues happen in context, and we rely on data engineers to know that context.

With approaches like segmentation analysis, we shift very little of that burden to the RCA tools themselves, even when we could be doing so, saving ourselves time and headaches in the process.

Here’s why data engineering teams need to go beyond segmentation analysis when it comes to reducing time to resolution for data downtime.

Case study: in-context RCA for data downtime

It’s time we take RCA for data downtime seriously, which means taking the structure of data downtime seriously. 

Modern data pipelines are complicated, interdependent systems with upstream and downstream dependencies. This includes software, data, and schedule dependencies, in other words, lineage.

When data goes wrong, something upstream in the pipeline – software, data, or schedule – has to be wrong, too. When dealing with vexing, real-time issues with data, we often forget that data environments are highly structured. When it comes to conducting RCA on broken data systems, we can – and should – use this to our advantage.

To make my point concrete, let’s look at one specific mainstay in the modern data pipeline: ELT jobs. ELT platforms, like dbt, run data transformations using scheduled DAGsDirected Acyclic Graphs.

An example DAG, showing how dbt models upstream and downstream dependencies in a single job. Source: dbt Docs.

DAGs are not arbitrary mathematical structures. They exhibit plenty of neat properties (transitive closure, topological ordering, etc.) that dramatically reduce the complexity of many graph algorithm problems (like finding shortest paths). 

They’re also the structure of Bayesian networks, powerful computational devices that we can use for (real) data downtime RCA. In fact, the structure of Bayesian networks modeling ELT schedules is simpler than a generic Bayesian network! Let’s walk through this via an example.

An example DAG

Let’s model a simple data downtime problem to show how. Here’s a simplified DAG showcasing part of an ELT schedule that runs at Monte Carlo:

This job combines some feature aggregation steps with a model training step, followed by some export steps. It’s a part of a larger machine learning pipeline. Note that _rs, _sf, and _bq refer to preprocessing metrics collected from Redshift, Snowflake, and BigQuery environments respectively.

Here’s the corresponding Gantt chart from one run:

A Gantt chart showing the runtimes of all jobs in the DAG, starting at the moment the DAG is kicked off.

Notice anything structurally interesting?

  • features_agg can’t run until all of the preprocessing_ steps are complete, because those preprocessing steps are upstream dependencies in the DAG. Even though preprocessing_sf finished early on this run, progress in the DAG is bottlenecked until preprocessing_rs finishes.
  • features_snapshot takes longer to run than metrics_train, but it doesn’t bottleneck the start of thresholds_loader, because it’s not an upstream dependency.

Calculating the DAG’s runtime

Oftentimes, data engineers want to predict the runtime of the entire DAG, whether to set an SLA or be assured that metrics will update with a certain cadence. If we want to know how long the whole DAG takes to run, we can’t just sum up all the individual job runtimes. We can’t take the maximum runtime either. Instead, we have to resort to features of the DAG. Which jobs are upstream from which other ones? Which jobs are the bottlenecks?

In this case, the runtime will always be given as this function of job runtimes:

time(DAG) = MAX(time(preprocessing_rs),
     time(preprocessing_bq)) +
 time(features_agg) +
     time(metrics_train) + time(thresholds_loader))

Maybe this equation is daunting, but the rules for creating it are simple:

  • If processes happen in parallel, take the MAX() of their runtimes
  • If processes happen in sequence (or in “series”), take the sum of their runtimes

With these two rules together, you can take any ELT DAG and build an equation that calculates its total runtime.

Using schedule dependencies to solve data freshness anomalies

All well and good when the DAG finishes on time. Of course, the interesting case is when it doesn’t. 

Many teams define SLOs to guarantee that certain processes complete by certain times. This particular DAG in our system runs every 6 hours. Sometimes, it fails or stalls, and in some cases data (from thresholds_loader) won’t arrive for 7 or 8 hours. In these cases we’d like to understand the root cause – why is our DAG not performing as expected? And what do we need to change to keep the updates coming on schedule?

Concretely, a Gantt chart for an SLO-breaching DAG run could look like this:

A Gantt chart showing a DAG run that took longer than 6 hours to complete.

Or it could look like this:

A Gantt chart showing a different DAG run taking longer than 6 hours. What’s different?

How do we say, mathematically, which jobs fault it was that the DAG took too long?

Enter Bayesian networks. Once again, we’ll turn to the ELT DAG. We already gave the equation for time(DAG) above, which tells us the total runtime as a function of job runtimes. If we know the runtimes from a particular DAG run we’re interested in, this is great. But what about in general? How long should we expect the DAG to take, probabilistically?

To think probabilistically we should turn to Bayesian networks, which means we need to make a new DAG… out of our DAG. (Don’t panic! Bayesian networks are just types of DAGs too – special ones intended to answer probabilistic queries using statistics and causal inference.)

To turn our ELT DAG into a Bayesian network we need to think of each job runtime no longer as a discrete value (like, “43 seconds”), but instead as a variable drawn from a distribution of runtimes.

Computational processes – especially ones that deal with data – often take a sort of consistent time to complete, but there will be variance. Maybe your load balancer is having a bad day, or maybe someone’s just barraged your API with requests. Maybe one of Redshift, Snowflake, or BigQuery is having independent performance issues.

So let’s start by making all the runtimes probabilistic. Now, each node elt_job represents a random variable, where P(elt_job < t) represents the probability that that job finishes in less than t seconds.

Ok… so this is our Bayesian network? Doesn’t it look exactly the same as the ELT DAG we had earlier?

Well, this is a legitimate Bayesian network, but it’s not a proper representation of our DAG at all. Why? We’ve actually thrown out some super valuable information – namely, the two rules about runtimes we discovered earlier. Here they are again:

  • If processes happen in parallel, take the MAX() of their runtimes
  • If processes happen in sequence (or in “series”), take the sum of their runtimes

Can we account for these rules in our Bayesian network? Spoiler alert: we can, and it decreases the complexity of our network dramatically. To do so we need to invent two “special” kinds of nodes with simple probability distributions. First, for parallel processes, we’ll have a “MAX” node, with a probability distribution given by:

For sequential processes, we’ll also have a “+” node, with a probability distribution given by:

Is this overly academically pedantic? Yes. Are we stretching the definition of a “random” variable? Maybe. But it’s really the lack of randomness here that’s important to take away, since it makes the resulting Bayesian network so much simpler computationally.

Here’s a transformed Bayesian network using those two special nodes we just defined:

The final node, DAG, is a random variable whose value depends entirely on the probabilistic runtimes of all the jobs (e.g. the variables features_agg, metrics_train, etc.) together with all the summing and maxing. More precisely, we know that:

P(DAG) = MAX(P(preprocessing_rs),
  P(preprocessing_bq)) +
   P(features_agg) +
       P(metrics_train) + P(thresholds_loader))

This equation looks a lot like the equation for the discrete runtime case, except that the discrete runtimes (time) have been replaced with probabilities for each runtime (P). The summing and maxing remained the same, though, which makes this equation pretty tractable.

Of course, the distributions like P(preprocessing_rs) aren’t going to magically fall out of the sky at this point. You’ll have to estimate these from a finite number of runs of the DAG itself. There’s a whole rabbit hole to fall into here, too, but for now let’s assume you can fit normal distributions to the runtime of each job in the ELT DAG.

A fitted normal distribution to a finite collection of observed runtimes (the red data points).

So now let’s get concrete. We can see that a lot of questions we’d like to ask about the DAG can be formulated as queries against our new Bayesian network.

Q: The DAG just took 7 hours to run. Is that abnormal?

A: What’s the value of P(DAG < 7 hours)? If that number is “high” – let’s say it’s 99% – then yeah! Abnormal! A 7 hour DAG runtime is a 1% likely occurrence.

Q: Did the features_snapshot job take too long to run?

A: Say the features_snapshot job ran in time t. What’s P(features_snapshot < t)? If this number is “high” – let’s say it’s 80% – then yes! t is a pretty abnormal runtime.

Q: (THE RCA QUESTION): Was features_snapshot responsible for the DAG taking 7 hours to run? To what extent?

A: What’s P(DAG < 7 | features_snapshot = t)? If this number is high, then it’s quite likely that the DAG can finish in under 7 hours even with features_snapshot taking t seconds. If it’s low, then it seems more likely that the DAG was bottlenecked by this one job. It’s even more likely if P(DAG < 7 hours) and P(features_snapshot >= t) are low. It’s especially likely if

P(DAG < 7 | features_snapshot = E[features_snapshot])

is high, which says that the DAG would have likely finished in fewer than 7 hours if features_snapshot had taken the expected amount of time.

As we can see, the causal tapestry to explore here is rich, even without venturing into causal inference for field-level anomalies, such as an unexpected uptick in NULL-rate or a sharp decrease in some key metric. What’s more, Bayesian networks and causal inference allow us to pose questions about actual RCA with math, rather than surfacing correlative hints that a human engineer has to stitch together into a narrative.

In my opinion, domains like these are the real forefront of RCA for data quality issues: leaning into the unique structure of data infrastructure, leveraging lineage, and treating software, data, and scheduling dependencies as the first-class citizens of causal inference for data.

Data teams have the potential to automate and scale much of the root cause analysis for data downtime, and segmentation analysis on a per-table basis is not going to get us there.

Until then, here’s wishing you no data downtime! 

What do you think? Reach out to Ryan Kearns to share your thoughts. And if data quality through faster, more streamlined incident resolution and lineage is top of mind for your organization this year, schedule a demo with our team.