What’s In Store for the Future of the Modern Data Stack?
Bob Muglia, former CEO of Snowflake, discusses what’s next for the tooling and technologies powering data analytics and engineering.
A few weeks ago, I had the opportunity to chat with Bob Muglia, former CEO of Snowflake and one of the pioneers of the modern data stack, to learn about his predictions for the future of our industry.
If Maxime Beauchemin is the father of data engineering, Muglia is certainly the father of the analytical cloud database, bringing to market one of the most popular solutions in the space: Snowflake, which launched in 2014 and popularized one of the most transformative technologies of the decade.
Before we begin, however, it’s important to understand what exactly we mean by modern data stack:
- It’s cloud-based
- It’s modular and customizable
- It’s best-of-breed first (choosing the best tool for a specific job, versus an all-in-one solution)
- It’s metadata-driven
- It runs on SQL (at least for now)
With these basic concepts in mind, let’s dive into Bob’s predictions for the future of the modern data stack. (For an even more wide-ranging conversation, be sure to check out my interview, below).
1. Data lakes and data warehouses will become indistinguishable
Data warehouses have been around for decades and took a leap forward in 2013 when Amazon Redshift first introduced cloud-based warehousing. In recent years, more customizable and flexible data lakes have become increasingly popular, and companies have had to evaluate whether a data warehouse or a data lake is the right choice for their business. That delineation won’t last for long, according to Bob.
“More and more, data lakes and data warehouses are coming together into one coherent thing,” said Bob. “I really think they’ll be very indistinguishable in five years. It’s really whether you’re looking at it as a file, or whether you’re looking at it as a relational table. That’s the right abstraction to think of. There are times when files are valuable, particularly when it comes to interchange, but most of the operations that you want to perform are actually performed in a relational architecture. And so this idea of a data lake and data warehouse are coming together.”
2. Analytics will merge with SQL-based systems within data platforms
This trend of cohesion continues into the space of analytics, including machine learning.
“There are basically five vendors in the industry today that are building a cloud platform that people are building on top of,” Bob said. “There’s Snowflake and Databricks, and then the three major cloud vendors—Amazon, Microsoft, and Google—all have their things. There’s coherence across all of this, and I think you’ll see analytic systems merging into the data platforms. You certainly see that with what Databricks is doing, and what Snowflake is doing, and all of the cloud vendors. You’ll see a very complete stack that will have both analytics and advanced analytics and machine learning systems, together with SQL-based data management systems.”
3. Universal standards for governance, lineage, and metrics will begin to emerge
In the immediate future, Bob hopes to see the industry begin to develop standards around data governance—starting in 2022.
“I see an opportunity for some key standards to be developed across the modern data stack related to governance, lineage, metrics, things like that,” said Bob. “I feel like there’s a need to allow for interoperability between these platforms and between tools, and there’s a need for some standards to exist.”
He doesn’t expect it to be easy, however. “Governance is not a simple problem, but it’s an important one because it is one that the world cares about—protecting people’s information is something that matters to everybody,” he said. “It matters to companies in terms of their reputation. It matters in terms of intellectual property rights. There are strong regulatory reasons. So this world is evolving and people have to stay on top of it. And while these tools of the modern data stack open up a lot of incredible capabilities of working with data, they also must be protected and appropriately managed to ensure that only the people who should have access to data are getting that access. And I think that while there are some tools that are available, we’re still very early in this.”
Bob has one specific recommendation for businesses looking to address governance.
“Gartner has done a pretty good job in laying out what they call the data fabric. That’s a model worth looking at when it comes to data governance. It’s high-level and abstract, but as someone who works with vendors, it’s a very good template to actually think through how to build some of these things.”
4. Predictive analytics will evolve dramatically
Looking forward a few years, Bob predicts significant changes in how predictive analytics work is accomplished.
“I think we’re going to continue to see the evolution of what’s happening with predictive analytics,” Bob said. “The current generation of predictive analytic systems are really built around a data frame and you use languages like Python or Scala to operate against that data frame. And while this is effective—people are doing it, and the tools are improving—I still think we’re at a very, very primitive level. And I expect to see some fairly dramatic improvements in the way machine learning is done in the next five to ten years.”
5. Knowledge graphs will be in high demand
Specifically, Bob foresees an increase in demand for knowledge graphs—one he believes data platforms will evolve to meet.
“The general trend, I think, is we’re going to start to see knowledge graphs emerging,” Bob said. “And the modern data stack will begin to evolve to enable knowledge graphs to be built through that. And that really takes the business logic associated with something and embeds it inside the database. That’s the distinction.”
6. The next generation of data sharing will require domain-oriented governance within (and between) organizations
Bob thinks that data sharing—both within organizations and between organizations, as commerce—is central to the future of our industry.
“Data starts with an organization and is created by an organization,” said Bob. “It is an asset that an organization creates, that you can then extract value from and utilize appropriately. And the work that Thoughtworks has done around the data mesh and the idea of organizational principles and domain-orientation of data is just correct.”
How that comes to fruition will look different for different organizations, Bob acknowledges.
“Now, if you’re a 50-person company, you probably don’t have a bunch of different data domains,” he said. “But if you’re a very large company, the domain-oriented data is conceptually correct. The thing that’s interesting is, what are the mechanisms you use to actually do that? That’s what we did with Snowflake. The idea of data sharing was to build the mechanisms required to enable a domain-oriented governance model. And so this idea of domain-oriented governance can very much apply in the modern data stack. That’s what data sharing is. If you look at what Snowflake has done with their data exchange within a company, it allows a company to set up different domains of data expertise, and then share that data with other organizations.”
Bob predicts this increased adoption of domain expertise will lead to an increase in the development of data apps. “People have data, but they also have knowledge about a business that they want to bring to that data. And that means you’re creating a data app—you’re taking data plus knowledge about the business and building an application that can take autonomous action based on what’s happening within that data. And that’s the next generation of data marketplaces and sharing—because different fields will have expertise defined all over the place.”
This impact will reverberate far beyond the data industry itself.
“If you have a small regional bank, being able to acquire analytic skills from a boutique organization that focuses on that industry is incredible,” Bob said. “You may not have the data scientist capabilities inside your organization, but you can rent them through an organization that’s providing it. I think we’ll see thousands of companies build analytic services that target vertical industries where they have expertise that they can apply and bring to the business. That’s the next generation of data sharing.”
Bottom line: 2022 is the year to focus on solving the problem of productization
As Bob sees it, the modern data stack will continue to open up opportunities for working with data within and across organizations, increasingly relying on teams to think about data as a product. And as more companies adopt a domain-oriented architecture and share or sell data to other businesses, the need for consistent, industry-wide standards of data trust and reliability will become more pressing.
“The modern data stack itself is still relatively nascent,” said Bob. “Whether it’s the observability and performance of data, whether it’s metadata management and governance—there’s still a tremendous amount of opportunity to improve and problems that customers are still trying to solve.”
Looking for more insights about how to build a modern data stack?
Download the 2021 Data Platform Trends Report.