Data Mesh – Fad or Fab?
Love it or hate it – it’s here. So, we turned to one of the experts to ask him some hard questions about the hottest topic in data.
Over the past year, data mesh has taken the data community by storm. Some people love it, others don’t quite understand it, and many are already migrating towards one even if they don’t yet know it. Regardless of where you fall, it’s hard to ignore the buzz.
For those unfamiliar, data mesh is a cultural and technical concept that distributes data ownership across product domain teams, with a centralized data infrastructure and decentralized data products. As analytics becomes increasingly decentralized to meet the needs of modern, insights-driven organizations, this paradigm is becoming more and more common. While I can’t speak for the concept’s creator, Zhamak Dehghani, I find data mesh to be particularly compelling because it ambitiously paints a picture of what data democratization and self-serve tooling can achieve when data engineering teams aren’t the bottleneck for analytics, as they often are.
Since Zhamak’s article first came out and after I wrote my first piece on the topic, customers have asked me about how they should go about implementing a data mesh architecture. Clearly, something was in the water.
Over the past several months and after hundreds of conversations, I have discovered four things about data mesh:
- Like any organizational shift, data mesh requires top-down buy-in to succeed.
- Data mesh is made up of multiple technologies, layered on top of each other as part of a modern data platform – you can’t make a data mesh with just compute, storage, and a BI tool.
- For most data leaders, the outcome (and not the journey) is what’s most defined. Data mesh as a concept is still nascent and there’s no right or wrong way of “building” one as long as the fundamental principles of data mesh are intact. And the outcome? Data democratization at scale.
- Data mesh reflects a larger movement towards the decentralization of data analytics – and I couldn’t be more excited.
Perhaps no one has been as enthusiastic about data mesh as Mammad Zadeh, the former VP of Engineering at Intuit for their Data Platform team. Mammad got his start in data infrastructure when he joined Yahoo back in 2000 as a senior architect and developer. While at Yahoo, he built a distributed cloud store and a complete end-to-end distributed monitoring framework. In 2009, he joined Netflix to build their new identity and membership platforms in the cloud. In 2012, he joined LinkedIn as their Head of Distributed Data Systems where he helped shape the vision for all real-time data infrastructure needs. During his time at Inuit, Mammad was responsible for leading the team behind the company’s data platform, where they modernized Intuit’s data infrastructure and began implementing their data mesh strategy.
But what is data mesh and why should we care about it?
I recently sat down with Mammad to chat about data mesh. Here are five reasons why Mammad is excited about it – and he thinks you should be, too.
Mammad Zadeh on Data Mesh
Data Mesh decentralizes responsibility – and that’s a good thing
As someone who has been working with data for many years and dealing with ways to scale data solutions, it has become clear to me that there is only so much you can do when all of the responsibility around data is on a central data team. We often like to isolate new and complex problems to a central team when specialized skills and expertise are needed. This tends to work well within vertical domains, and in the early days, data analytics was kind of like that. But today, data analytics, machine learning, and access to good data are at the heart of everything we do and growing at an exponential rate. We need a new way of organizing ourselves, and a new way of thinking about the architecture and the infrastructure needed to satisfy the needs of the producers and consumers of data within an organization. And that means restructuring the central data team monolith and distributing the responsibility of data across engineering domains.
In some ways, I think, we can see a parallel with the mobile technology journey. A decade ago, it was common to see a central ‘mobile team’ where all mobile app development would take place. Today, however, mobile development is an integral part of every product development team. This ‘shifting’ of responsibility to the domain dev teams and away from a specialized central team can also be seen in operations and quality engineering.
This is a good thing because it helps us scale better and, more importantly, it puts accountability in the right place; where the data is produced. The domain teams are the experts on their own data and it is on them to make sure good quality data is available to the consumers of that data, oftentimes analysts and data scientists. We, in the central data teams, should make sure the right self-serve infrastructure and tooling is available to both producers and consumers of data so that they can do their jobs easily. Equip them with the right tools, let them interact directly, and get out of the way.
So this is all great, but we’re not quite there yet. So far, engineering leaders have been reluctant to change their traditional organizational structure around data, data engineering, and data science. Partly because this is still a relatively new concept and success stories are not abundant. The other main issue is that we are still nowhere near the level of self-service, easy-to-use infrastructure-as-a-service that we need in order to push this responsibility to the domain teams.
Data mesh is trying to address these problems head-on and that gives me a lot of hope. The main principles of data mesh are domain ownership, data as a product, self-serve data platform, and federated governance, all of which are absolutely essential to help us scale data analytics and machine learning for the next decade.
Data Mesh makes data a first-class citizen
The result of decentralization is that data becomes a first-class citizen for the dev teams. Data mesh introduces the notion of a ‘data product’ and like other functional capabilities of a product, it is to be owned and operated by the domain engineering teams. The nodes on the mesh are these decentralized data products.
It’s important to remember that most organizations are still in the very early stages of figuring out their data mesh strategy and their data products. As we rethink our data analytics approach, we inevitably need to revisit some of our old habits that get in the way. For example, collecting analytics data from the guts of transactional databases was a shortcut that introduced a lot of complexity. And as I said before, the tooling isn’t quite there either, which means developing and operating these data products still require specialized skills. Over time two things will change: 1) new engineers entering the workforce would be more familiar with data techniques, and 2) the data infrastructure itself would get much better, easier to use, and completely self-serve. At that point, owning and operating a data product would be similar to owning and operating a functional microservice.
And just to be clear, when I talk about giving the responsibility of data to domain teams, I am not talking about assigning a single point of contact where one person in a domain knows everything around their data. I am talking about extreme ownership of data with full accountability to make sure their data is relevant, consumable, accessible, and trustable. In other words, treating data as a product. If this does not happen, the mesh cannot take off.
Data Mesh puts the onus on vendors to improve the experience
Lots of folks are asking whether data mesh is just suitable for larger organizations where the complexity and scale of data is a big problem. I think following the principles of data mesh is good architecture and is going to be the right thing to do, whether you are a small startup or a large company with tons of legacy data in a monolithic data lake.
Having said that, once you get deeper into the design and implementation of the mesh, you’ll notice that the current state of tooling largely assumes a central data monolith and is not very well integrated with standard development practices, making it difficult for the existing product developers to build and operate their data products. This is an opportunity for the data platform providers to rethink their products around this new experience. And engineering leaders could play a big role in pushing vendors to prioritize this.
I’m also hoping that we would see more open standards at the infrastructure level to encourage vendors to move away from their walled gardens and build tools that could interoperate with other components from other vendors. I think having this kind of interoperable marketplace for solutions is critical for our industry.
In my opinion, this best-of-breed approach will allow engineering leaders to design and compose their infrastructure in a way that works best for them as opposed to having to be locked in with one vendor. Over time, the infrastructure would become smarter to allow generalist developers to build and operate their data products autonomously and with ease. Who knows, maybe at some point we wouldn’t make a distinction between functional microservices and data microservices (i.e., data products)!
Data Mesh prioritizes data governance for the entire organization
As one of its main principles, data mesh calls out the need for a federated governance design. This allows each domain to implement the appropriate solutions for org-wide data policies to ensure availability, quality, trustworthiness, and compliance of data.
What you and your team at Monte Carlo are doing is addressing a key component of this framework since data observability plays a foundational role in ensuring data quality across the entire mesh. I think observability needs to do a few things extremely well:
- It should be able to detect any sort of anomaly that occurs in the mesh
- It should be intelligent, automated, resilient, and very reliable
- And it should have real-time open APIs to programmatically interact with it
As simple as this might sound, these are hard problems to solve. One thing I can’t stress enough is the need for open real-time interfaces. As a practitioner, I want to be able to consume the events that are generated by the observability layer and build automation, self-resiliency, and notifications as soon as an anomaly is detected. Having strong open APIs allow engineers to stitch things together in ways that have not been done before and solve problems as they arise, rather than wait for the vendor to fill the gap.
Data Mesh brings data closer to product engineering
Irrespective of data mesh, data and machine learning will become part of what everybody does; it will become ingrained into the DNA of product engineering teams. This shift is well underway, and is very exciting – but can also be unsettling at the same time.
On the one hand, over the next decade, we are going to see exponential technological advancements fueled by data. On the other hand, we do not truly know yet how the consumers would be using the data. This is why we need to make sure we have the right paradigm in place that can give us both scale and the flexibility to change. Data mesh is trying to address that.
Data mesh brings data more in line with product development as a whole, rather than being a specialized and siloed activity. Back in the day, you had the “one and only” database and it was owned by the application engineers. Analysts would have to beg and borrow time, usually after hours, to run their queries because it was not a priority and it was considered more as a back-office activity. As the importance of data grew, we split ourselves into transactional and analytical realms, created centralized data teams, and delegated all of that complexity and responsibility to them. And while it worked for a while, we are at a point now where that model is not scaling any further. That’s why I’m excited about the prospects of data mesh.
Curious about the data mesh?
Register for Zhamak Dehghani’s talk at IMPACT: The Data Observability Summit this November to learn more about the data mesh, too!