Why Testing Your Data Is Insufficient
In 2021, data testing alone isn’t sufficient for ensuring accurate and reliable data. Just as software engineering teams leverage solutions like New Relic, DataDog, and AppDynamics to monitor the health of their applications, modern data teams require a similar approach to monitoring and observability. Here’s how you can leverage both testing and monitoring to prevent broken data pipelines and achieve highly reliable data.
For most companies, data is the new software.
Like software, data is fundamental to the success of your business. It needs to be “always-on”, with data downtime treated as diligently as application downtime (five nines, anyone?). And just like your software, adhering to your team’s data SLAs is critical for maintaining your company’s competitive advantage.
While it makes sense that many teams would approach testing their data with the same tried-and-true methods they apply to testing the accuracy and reliability of their software, our industry is at a tipping point: data testing alone is insufficient.
Relying on data testing to find issues in your data pipeline before you run analysis is equivalent to trusting unit and integration testing to identify buggy code before you deploy new software, but it’s insufficient in modern data environments. In the same way that you can’t have truly reliable software without application monitoring and observability across your entire codebase, you can’t achieve full data reliability without data monitoring and observability across your entire data infrastructure.
Rather than relying exclusively on testing, the best data teams are leveraging a dual approach, blending data testing with constant monitoring and observability across the entire pipeline. Let’s take a closer look at what this means, and how you can start to apply data monitoring to your own stack.
What is data testing?
Data testing is the process of validating your assumptions about your data at different stages of the pipeline. Basic data testing methods include schema tests or custom data tests using fixed data, which can help ensure that ETLs run smoothly, confirm that your code is working correctly in a small set of well-known scenarios, and prevent regressions when code changes.
Data testing helps by conducting static tests for null values, uniqueness, referential integrity, and other common indicators of data problems. These tools allow you to set manual thresholds and encode your knowledge of basic assumptions about your data that should hold in every run of your pipelines.
In fact, data testing is a great solution for specific, well-known problems and will warn you when new data or new code breaks your original assumptions. You can even use testing to determine whether or not your data meets your criteria for validity — such as staying within an expected range or having unique values. This is very similar in spirit to the way software engineers use testing to alert on well understood issues that they anticipate might happen.
But, much in the same way that unit tests alone are insufficient for software reliability, data testing by itself cannot prevent broken data pipelines.
Here are 3 reasons why a hybrid approach that marries testing and monitoring is needed to pave the way forward for the modern data stack.
Data changes, a lot
In software engineering, we heavily use testing to find anticipated issues in our code. However, every software engineer knows this is insufficient if she is looking to deliver a highly reliable application. Production environments tend to have much more variability than any engineer could hope to anticipate during development.
Whether it is an edge case in business logic, hard-to-predict interactions between software components, or unanticipated input to the system, software issues will inevitably occur. Therefore, a robust strategy to reliability will combine testing as a sanity check, with monitoring and observability to validate correctness and performance in the actual production environment.
Data is no different. While testing can detect and prevent many issues, it is unlikely that a data engineer will be able to anticipate all eventualities during development, and even if she could, it would require an extraordinary amount of time and energy.
In some ways, data is even harder to test than traditional software. The variability and sheer complexity of even a moderately sized dataset is huge. To make things more complicated, data also oftentimes comes from an “external” source that is bound to change without notice. Some data teams will struggle to even find a representative dataset that can be easily used for development and testing purposes given scale and compliance limitations.
Monitoring and observability fill these gaps by providing an additional layer of visibility into these inevitable — and potentially problematic — changes to your pipelines.
End-to-end coverage is critical
For many data teams, creating a robust, high coverage test suite is extremely laborious and may not be possible or desirable in many instances — especially if several uncovered pipelines already exist. While data testing can work for smaller pipelines, it does not scale well across the modern data stack.
Most modern data environments are incredibly complex, with data flowing from dozens of sources into a data warehouse or lake and then being propagated into BI/ML for end-user consumption or to other operational databases for serving. Along the way from source to consumption, data will go through a good number of transformations, sometimes into the hundreds.
The reality is that data can break at any stage of its life cycle — whether as a result of a change or issue at the source, an adjustment to one of the steps in your pipeline, or a complex interaction between multiple pipelines. To guarantee high data reliability, we must therefore have end-to-end visibility into breakages across the pipeline. At the very least, we must have sufficient observability to be able to troubleshoot and debug issues as data propagates through the system.
Data testing becomes very limited with that in mind for a several reasons, including:
- Your pipelines may leverage several ETL engines and code frameworks along the way, making it very challenging to align on a consistent testing strategy across your organization.
- Strong coupling between transformations and testing introduces unreliability into the system — any intended change to ETL (or, in some cases, unintended failure) will lead to tests not running and issues missed.
- The complexity and sheer number of pipeline stages can make it quite onerous to reach good testing coverage.
And this just scratches the surface of data testing’s limitations when it comes to ensuring full data reliability.
Data testing debt
While we all aspire to have great testing coverage in place, data teams will find that parts of their pipelines are not covered. For many data teams, no coverage will exist at all, as reliability oftentimes takes the backseat to speed in the early days of pipeline development.
At this point, going back and adding testing coverage for existing pipelines may be a huge investment. If key knowledge about existing pipelines lies with a few select (and often very early) members of your data team, retroactively addressing your testing debt will, at the very best, divert resources and energy that could have otherwise been spent on projects that move the needle for your team. At the very worst, fixing testing debt will be nearly impossible if many of those early members of your team are no longer with the company and documentation isn’t up to date.
A solid monitoring and observability approach can help mitigate some of the challenges that come with data testing debt. By using an ML-based approach that learns from past data and monitors new incoming data, teams are able to create visibility into existing pipelines with little to no investment and folklore knowledge, as well as reduce the burden on data engineers and analysts to mitigate testing debt as it accrues.
The next step: data monitoring and observability
In 2021, data engineers are at a critical juncture — keep pace with the demands of our growing, ever-evolving data needs or settle for unreliable data. For most, there isn’t a choice.
Just like software, data requires both testing and monitoring to ensure consistent reliability. Modern data teams must think about data as a dynamic, ever-changing entity, and applying a different approach that focuses not just on rigorous testing, but also continual monitoring and observability.
By approaching data reliability with the same diligence as software reliability, data teams can ensure the health of their data at all times and across several key pillars of data health, including volume, schema, freshness, lineage, and distribution — before they affect your business.