Using Data Observability For Third-Party Data Validation
Data-driven companies develop many dependencies on third-party data and thus have a need for third-party data validation. In the past couple weeks, I’ve chatted with several:
- A direct-to-consumer brand that relies on ad campaign data from Facebook and Snapchat;
- An ecommerce company that relies on shipping data from UPS and Fedex;
- A FinTech organization that needs fresh market data each night;
- An InsurTech company that needs updated policy data from insurance providers….the list goes on.
Ingesting this data in a reliable and scalable way isn’t easy. Some teams receive data from dozens, or even hundreds, of external sources each day.
As the requests mount, the number of pipelines do as well. As a result, third-party data validation evolves from a technical challenge to a management and visibility challenge. Each pipeline comes with its own unique expectations, such as the frequency of data delivered.
For example, one media company was the first to realize their online video partner’s (a tech giant serving millions of users) data feed sent files four hours later than usual which resulted in the ETL job not updating correctly.
When there’s a problem with data coming from a third-party, downstream consumers are impacted and they don’t care who the bad data came from. Once it goes through your pipes, you own it.
Issues with this data will make your data consumers start to lose confidence in the data, stop using it, or make decisions based on incomplete, stale, or false data.
For example, a gaming company noticed drift in their new user acquisition data. The social media platform they were advertising on changed their data schedule so they were delivering data every 12 hours instead of 24.
The company’s ETLs were set to pick up data only once per day, so this meant that suddenly half of the campaign data that was being sent to them wasn’t getting processed or passed downstream, skewing their new user metrics away from “paid” and towards “organic.” While caught in this instance, this could have also just as easily driven a significant departure in their acquisition strategy.
On top of that, the initial incident discovery can take anywhere from hours to weeks as the triaging teams trace transformation queries through intermediate tables before ultimately discovering the upstream pipeline failure.
The third-party data validation problem
To catch that pipeline failure earlier, teams typically then ask these questions, in order:
- How do I know the data has arrived on time?
- If the data has arrived on time, how do I know we’ve received the expected volume of data?
- If both of those are correct, then how do we know the data is also correct and of high quality?
Each of these questions is then repeated for every single critical pipeline ingesting data from external sources. Not to mention, these questions also arise for internal pipelines like those ingesting data from sales CRMs or core business/product databases.
The solution: Ingestion Validation Monitoring
Monte Carlo’s latest type of monitor, Ingestion Validation, addresses exactly those third-party data validation concerns. With a few clicks, you can add customized monitoring to:
- Augment Monte Carlo’s automated freshness and volume monitoring with a set of specific rules about the expected arrival time of data, and the size and growth of the table. This ensures the right amount of data has landed before it is picked up by other jobs or APIs.
- Add field-level data quality checks, which uses the table’s recent data to create baseline thresholds for a variety of metrics such as the percentage of unique or null values. These checks help measure the quality of data running through the pipeline so, for example, you’ll be alerted if a column suddenly sees a spike in NULLs.
By leveraging Ingestion Validation Monitoring, your organization can feel confident making decisions knowing that you’re using the most up-to-date information available, while saving engineering time from writing validation tests and triaging incidents across a multitude of pipelines.
Interested in how your organization can more efficiently conduct third-party data validation? Schedule a time to talk to us and see more of our data observability platform below.