“The data and AI space moves fast. If you don’t stop and look around once in a while, you just might miss it.”
Wondering what’s next for the future of data engineering and GenAI? Each year, we chat with one of the data industry’s pioneering leaders about their predictions for the modern data stack – and share a few of our own. And this year, we invited our good friend and famed venture capitalist Tomasz Tunguz to share his data engineering predictions for 2024.
As General Partner at Theory Ventures and an investor in Motherduck, Monte Carlo, and other trailblazing companies, Tomasz Tunguz knows a thing or two about predictions.
We picked some of our favorite Tomasz predictions and a few of our own to give you the ultimate top 10 data engineering trends for 2024.
Ready to see the future? Grab your crystal balls and let’s take a peek!
Pro-tip: if you want the full scoop, be sure to check out Tomasz Tunguz’s talk from IMPACT: The Data Observability Summit.
1. LLMS will transform the stack
This one was a given, but we’re still taking credit.
It’s no exaggeration to say that large language models have transformed the face of technology over the last 12 months. From companies with legitimate use cases to fly by night teams with technology on the hunt for a problem, everyone and their data steward is trying to use genAI in one fashion or another.
And LLMs are set to continue that transformation into 2024 and beyond—from driving increased demand for data and necessitating new architectures like vector databases (“the AI stack”), to changing the way we manipulate and use the data for our end-users.
Automated data analysis and activation will become an expected tool in every product and at every level of the data stack. The question is: how do we make sure these new products are providing real value in 2024 and not just a little new flash for the PR credit?
2. Data teams will look like software teams
The most sophisticated data teams are viewing their data assets as bonafide data products—complete with product requirements, documentation, sprints, and even SLAs for end-users.
So, as organizations begin mapping more and more value to their defined data products, more and more data teams will start looking—and being managed—like the critical product teams that they are.
3. And software teams will become data practitioners
When engineers try to build data products or genAI without thinking about the data, it doesn’t end well. Just ask United Healthcare.
As AI continues to eat the world, engineering and data will become one in the same. No major software development will enter the market without an eye toward AI—and no major AI will enter the market without some level of real enterprise data powering it.
That means that as engineers seek to elevate new AI products, they’ll need to develop an eye toward the data—and how to work with it—in order to build models that add new and continued value.
4. RAG will be all the RAGe
After a series of high-profile GenAI failures, the need for clean, reliable, and curated context data to augment AI products has become increasingly obvious.
As the AI field continues to develop and blind spots in general training become painfully apparent, teams with proprietary data will turn to RAG and fine-tuning en masse to augment their enterprise AI products and deliver a demonstrable value moat for their stakeholders.
5. Teams will operationalize enterprise-ready AI products
The data engineering trend that keeps on trending—data products. And make no mistake, AI is a data product.
If 2023 was the year of AI, 2024 will be the year of operationalizing AI products. Whether out of need or coercion, data teams across industries will embrace enterprise-ready AI products. The question is—will they really be enterprise ready?
Gone are (hopefully) the days of creating random chat features just to say you’re integrating AI when the board asks. In 2024, teams are likely to become more sophisticated about how they develop AI products leveraging better training practices to create value and identifying problems to solve instead of pumping out technology to create new problems.
6. Data observability will support AI and vector databases
The most common answer? Data quality.
Generative AI is, at its core, a data product. And like any data product, it doesn’t function without reliable data. But at the scale of LLMs, manual monitoring can’t provide the comprehensive and efficient quality coverage required to make any AI reliable.
To truly be successful, data teams need a living, breathing data observability solution tailored to AI stacks that can empower them to detect, resolve, and prevent data downtime consistently within the context of a growing and dynamic environment. Data observability solutions like Monte Carlo that prioritize resolution, pipeline efficiency, and the streaming/vector infrastructures that support AI will be essential in the modern AI reliability battle in 2024.
7. Big data will get small
Thirty years ago, a personal computer was a novelty. Now, with modern Macbooks boasting the same computational power as the AWS servers Snowflake launched their MVP warehouse on in 2012, hardware is blurring the lines between commercial and enterprise solutions.
Tomasz predicts that since most workloads are small, data teams will begin to use in-process and in-memory/in-process databases to analyze and move datasets.
Particularly for teams that need to scale quickly, these solutions are fast to get started and can rise to enterprise level functionality with commercial cloud offerings.
8. Right-sizing will take priority
Today’s data leaders are faced with an impossible task. Use more data, create more impact, leverage more AI — but lower those cloud costs.
As Harvard Business Review puts it, chief data and AI officers are set up to fail. As of Q1 2023, IDC reports that cloud infrastructure spending rose to $21.5 billion. According to McKinsey, many companies are seeing cloud spend grow up to 30% each year.
Low-impact approaches like metadata monitoring and tools that allow teams to see and right-size utilization will be invaluable in 2024
9. The Iceberg will rise (Apache Iceberg)
Apache Iceberg is an open source data lakehouse table format developed by the data engineering team at Netflix to provide a faster and easier way to process large datasets at scale. It’s designed to be easily queryable with SQL even for large analytic tables with petabytes of data.
Where modern data warehouses and lakehouses will offer both compute and storage, Iceberg focuses on providing cost effective, structured storage that can be accessed by the many different engines that may be leveraged across your organization at the same time, like Apache Spark, Trino, Apache Flink, Presto, Apache Hive, and Impala.
Recently, Databricks announced that Delta tables metadata will also be compatible with the Iceberg format, and Snowflake has also been moving aggressively to integrate with Iceberg. As the lakehouse becomes a de facto solution for many organizations, Apache Iceberg—and Iceberg alternatives—are likely to continue to grow in popularity as well.
10. Return to office for…someone
RTO—everyone’s least favorite initialism. Or possibly their favorite! Honestly, we can’t keep up at this point. While teams appear to be divided on the issue, more and more teams are being called back to their cubicle/open floor plan/flexible working environments for at least a couple days per week.
According to a September 2023 report by Resume Builder, 90% of companies plan to enforce return-to-office policies by the end of 2024—nearly four years after that fateful spring in 2020. In fact, several powerful CEOs – including Amazon’s Andy Jassy, OpenAI’s Sam Altman, and Google’s Sundar Pichai – have already enacted return-to-office policies over the past several months.
And there do appear to be at least some benefits to working in an office (at least part-time) versus exclusively from home.
Find yourself in the stay-at-home-forever camp? It appears the answer—as is always the case in data—is to deliver more value. Despite recent economic headwinds and its impact on the job market, data and AI teams are in high demand. And employers will often do what it takes to get them—and keep them. While some companies are mandating all employees return to the office regardless of role, other companies like Salesforce are requesting that non-remote engineers go in much less, for a total of 10 days per quarter.