To keep pace with data’s clock speed of innovation, data engineers need to invest not only in the latest modeling and analytics tools, but also technologies that can increase data accuracy and prevent broken pipelines. The solution? Data observability, the next frontier of data engineering and a pillar of the emerging Data Reliability category.
As companies become increasingly data driven, the technologies underlying these rich insights have grown more and more nuanced and complex. While our ability to collect, store, aggregate, and visualize this data has largely kept up with the needs of modern data teams (think: domain-oriented data meshes, cloud warehouses, data visualization tools, and data modeling solutions), the mechanics behind data quality and integrity has lagged.
No matter how advanced your analytics dashboard is or how heavily you invest in the cloud, your best laid plans are all for naught if the data it ingests, transforms, and pushes to downstream isn’t reliable. In other words, “garbage in” is “garbage out.”
Before we address what Data Reliability looks like, let’s address how unreliable, “garbage” data is created in the first place.
How good data turns bad
After speaking with several hundred data engineering teams over the past 12 months, I’ve noticed there are three primary reasons why good data turns bad: 1) a growing number of data sources in a single data ecosystem, 2) the increasing complexity of data pipelines, and 3) bigger, more specialized data teams.
More and more data sources
Nowadays, companies use anywhere from dozens to hundreds of internal and external data sources to produce analytics and ML models. Any one of these sources can change in unexpected ways and without notice, compromising the data the company uses to make decisions.
For example, an engineering team might make a change to the company’s website, thereby modifying output of a data set that is key to marketing analytics. As a result, key marketing metrics may be wrong, leading the company to make poor decisions about ad campaigns, sales targets, and other important, revenue-driving projects.
Increasingly complex data pipelines
Data pipelines are increasingly complex with multiple stages of processing and non-trivial dependencies between various data assets. With little visibility into these dependencies, any change made to one data set can have unintended consequences impacting the correctness of dependent data assets.
Something as simple as a change of units in one system can seriously impact the correctness of another system, as in the case of the Mars Climate Orbiter. A NASA space probe, the Mars Climate Orbiter crashed as a result of a data entry error that produced outputs in non-SI units versus SI units, bringing it too close to the planet. Like spacecraft, analytic pipelines can be extremely vulnerable to the most innocent changes at any stage of the process.
Bigger, more specialized data teams
As companies increasingly rely on data to drive smart decision making, they are hiring more and more data analysts, scientists, and engineers to build and maintain the data pipelines, analytics, and ML models that power their services and products, as well as their business operations.
Miscommunication or insufficient coordination is inevitable, and will cause these complex systems to break as changes are made. For example, a new field added to a data table by one team may cause another team’s pipeline to fail, resulting in missing or partial data. Downstream, this bad data can lead to millions of dollars in lost revenue, erosion of customer trust, and even compliance risk.
The good news about bad data? Data engineering is going through it’s own renaissance and we owe a big thank you to our counterparts in DevOps for some of the key concepts and principles guiding us towards this next frontier.
The next frontier: data observability
An easy way to frame the effect of “garbage data” is through the lens of software application reliability. For the past decade or so, software engineers have leveraged targeted solutions like New Relic and DataDog to ensure high application uptime (in other words, working, performant software) while keeping downtime (outages and laggy software) to a minimum.
In data, we call this phenomena Data Downtime. Data Downtime refers to periods of time when data is partial, erroneous, missing, or otherwise inaccurate, and it only multiplies as data systems become increasingly complex, supporting an endless ecosystem of sources and consumers.
By applying the same principles of software application observability and reliability to data, these issues can be identified, resolved and even prevented, giving data teams confidence in their data to deliver valuable insights.
Below, we walk through the five pillars of data observability. Each pillar encapsulates a series of questions which, in aggregate, provide a holistic view of data health. Maybe they’ll look familiar to you?
- Freshness: is the data recent? When was the last time it was generated? What upstream data is included/omitted?
- Distribution: is the data within accepted ranges? Is it properly formatted? Is it complete?
- Volume: has all the data arrived?
- Schema: what is the schema, and how has it changed? Who has made these changes and for what reasons?
- Lineage: for a given data asset, what are the upstream sources and downstream assets which are impacted by it? Who are the people generating this data, and who is relying on it for decision making?
A robust and holistic approach to data observability requires the consistent and reliable monitoring of these five pillars through a centralized interface that serves as a central source of truth about the health of your data.
An effective, proactive data observability solution will connect to your existing stack quickly and seamlessly, providing end-to-end lineage that allows you to track downstream dependencies. Additionally, it will automatically monitor your data-at-rest without requiring the extraction of data from your data store. This approach ensures that you meet the highest levels of security and compliance requirements and scale to the most demanding data volumes.
Such a solution also requires minimal configuration and practically no threshold-setting. It uses ML models to automatically learn your environment and your data. It uses anomaly detection techniques to let you know when things break. And it minimizes false positives by taking into account not just individual metrics, but a holistic view of your data and the potential impact from any particular issue.
This approach provides rich context that enables rapid triage and troubleshooting, and effective communication with stakeholders impacted by Data Reliability issues. Unlike ad hoc queries or simple SQL wrappers, such monitoring doesn’t stop at “field X in table Y has values lower than Z today.”
Perhaps most importantly, such a solution prevents Data Downtime incidents from happening in the first place by exposing rich information about data assets across these five pillars so that changes and modifications can be made responsibly and proactively.
What’s next for data observability?
Personally, I couldn’t be more excited for this new frontier of data engineering. As data leaders increasingly invest in Data Reliability solutions that leverage data observability, I anticipate that this field will continue to intersect with some of the other major trends in data engineering, including: data mesh, machine learning, cloud data architectures, and the platformatization of data products.
Interested in pioneering the field of data observability with Monte Carlo? Apply for a role on our team!