With more than 30 years of experience managing data at organizations like Ticketmaster and Eventbrite, Ed Presz considers himself a veteran of the data space. But when he left Eventbrite to lead the data team at Pie Insurance, that meant learning a new industry with unique challenges.
From day one, Ed knew data quality was going to be key. As an insurance provider that focuses on workers’ comp and commercial auto for small businesses, Pie employees needed access to accurate, reliable information to meet its customers’ needs and stay in compliance. That’s where data observability came in.
To ensure reliability beyond the data engineering team and across the organization, Ed and his team developed a self-serve incident resolution workflow that empowered business users to directly address issues, without involving data engineering at every step of the process.
At our recent IMPACT 2023, Ed shared the process, culture, and technologies behind his team’s enviable self-serve workflows. Read on for his key takeaways and the mistakes he learned from.
Table of Contents
Better data quality at Pie Insurance
When Ed joined Pie Insurance in October 2021, improving data trust was top-of-mind. “As the leader of the data engineering organization, if the business doesn’t trust the data, then I might as well just stay home,” Ed says. “They need to be able to make decisions off of that data. So the goal of my team is to make sure that data trust is at an all-time high.”
For Ed and his team, that meant focusing on three primary challenges: improving data quality, decreasing data downtime, and eliminating alert fatigue.
Data observability improved time to detection
To improve data quality, the team measures and tracks key metrics like mean time-to-detection (MTTD), mean time-to-acknowledge (MTTA), and mean time-to-resolution (MTTR). Decreasing those timespans minimizes the impact of data downtime — moments when data is erroneous, incomplete, or otherwise inaccurate.
“The consequences of data downtime can be very severe,” Ed says. “There are compliance risks. There’s lost revenue. And there’s the big one: erosion of data trust.”
To improve all three metrics and speed up the time it takes to identify and resolve data downtime, Ed and his team rely on Monte Carlo’s data observability platform. Monte Carlo provides end-to-end coverage for data quality across their data stack, including key integrations with Snowflake, OpsGenie, Fivetran, Looker, and Slack.
With automated monitoring and alerting, Monte Carlo surfaces incidents like missing data or unexpected schema changes. The team can rely on automated monitors for data quality rules generated by machine learning models, or create custom monitors for specific data needs.
This allowed the data team to be the first to know when data incidents occur — greatly improving their mean time-to-detection (MTTD). But to truly speed up the time-to-resolution, Ed wanted to get business stakeholders more directly involved in data quality efforts.
“Improving data downtime is not just a data engineering initiative,” he says.
By surfacing alerts of data quality issues to the stakeholders who are closest to the data and able to address them, Ed could drastically improve the time-to-resolution. So that’s exactly what he did.
But better detection led to more alerts. And a single Slack channel wasn’t cutting it.
Ed and his team used the Monte Carlo integration with Slack to send alerts to a dedicated channel for data quality — proactively identifying data issues and pinging the right stakeholders to help resolve them.
However, that’s when Pie’s third data quality challenge started to emerge: alert fatigue. The channel was full of notifications about all kinds of issues, and key incidents were getting missed due to sheer noise.
“We were alerting on SQL rules, schema changes, field metrics, JSON format changes, volume rules, freshness rules, dimension tracking,” says Ed. “What we ended up with was a Slack channel that was pretty noisy. Quite frankly, when we targeted certain alerts at specific business folks, they were intimidated by that channel. Alerts were being lost.”
How Monte Carlo and multiple Slack channels enabled self-serve workflows to eliminate alert fatigue
To address alert fatigue, Ed’s team reconfigured their alerting system to route notifications across five different Slack channels. They separated business stakeholders into a dedicated channel (#mc-pii-insurance-business) and added distinct channels for data engineering, schema changes, and QA and testing.
The new approach worked. Now, Ed and his team have a well-oiled machine of monitoring, alerting, and resolution. Their self-serve incident resolution framework uses features like tags and audiences to target the right users and groups in the right Slack channels with detailed notifications that include all the information they need to fix issues quickly.
“End users don’t need to run queries to recreate the data that we found, or they don’t need to log into Monte Carlo to check out the alert,” Ed says. “We can identify the problematic data and surface it right in the Slack channel so key business stakeholders can see that, aha, this piece of data is inaccurate and we need to fix it.”
Empowering business users by removing data engineers from common incident resolution workflows
One common example: a SQL rule monitor might detect that a company’s Federal Employee Identification number (FEIN) isn’t valid, usually within their Salesforce instance.
“We know that folks in agencies across Pie are seeding some value that they know isn’t right, but they’re going to go back and fix it,” says Ed. “FEIN is a value that meets certain patterns, so we know when it’s invalid — and if we see this, we want to surface it to the right person in the agency to fix it. Because things are going to fall apart from a data perspective if we don’t have this number right.”
While the old system involved pinging data engineers with any data incident and triggered multiple handoffs, Ed’s team was able to set up an automated workflow. The new approach includes all the information about the missing FEIN directly within the Slack notification, so the right business user is tagged and can tell at-a-glance what needs to be fixed.
Then, the incident status tracker makes the current status of every data incident clear and visible.
“It eliminates a number of links in the communication channel and empowers business stakeholders to fix issues and improve data quality,” says Ed. “And data engineering is no longer in the mix.”
Results: Faster incident resolution, fewer data engineering resources, and no surprises
Today, Pie’s time-to-resolution is greatly reduced and their data engineers have more time to spend on strategic work.
“Data engineering set up the monitors, but we’re no longer involved in any of the resolution steps,” says Ed. “It frees up important data engineering resources to be doing other things. A number of these data issues get resolved quickly, and that’s all done by the key stakeholders.”
This also means business stakeholders aren’t noticing the data quality issues themselves — they’re being notified by the automated workflows created by Ed and his team. That transparency improves data trust.
“At the end of the day, when there is data downtime, I don’t want to be informed by our key stakeholders,” says Ed. “I don’t want to hear about it. I want us to be able to find it first.”
Key takeaways for incident resolution with data observability
Ultimately, Ed believes that data leaders need to make data trust a top priority — and an organizational initiative.
“You really need to empower the business to decrease data downtime and improve data trust,” says Ed. “That’s a business-wide undertaking. If you’re leaning just on data engineering, that’s going to fail.”
Data observability makes it possible for data teams to play a proactive role in identifying and surfacing data quality incidents — while going directly to the stakeholder source for resolution. These workflows can take trial-and-error to set up, but when done right, it can save your data team’s valuable time, resources, and reputation.
Curious to see how your organization could benefit from self-serve incident resolution? Get in touch with our data observability experts to learn more.
Our promise: we will show you the product.