Data Lakes vs. Data Warehouses: The Truth Revealed
The data lake vs data warehouse debate is heating up with recent announcements at Snowflake Summit including Apache Iceberg and hybrid tables on one side, and the metadata related announcements at Databrick’s Data + AI around the new Unity Catalog. The old battle lines around “raw vs processed data” or “data engineer vs data scientist” are fading and new differentiators are emerging.
It’s clear this debate isn’t going anywhere, but the technologies are evolving — fast. In the first article in our data platform series, we discussed how to approach building your data platform like a product. Now, we share everything you need to know about the foundation of your data infrastructure: data lake vs data warehouse.
Twenty years ago, your data warehouse probably wouldn’t have been voted hottest technology on the block. These bastions of the office basement were long associated with siloed data workflows, on-premises computing clusters, and a limited set of business-related tasks (i.e., processing payroll, and storing internal documents).
Now, with the rise of data-driven analytics, cross-functional data teams, and most importantly, the cloud, the terms “modern data warehouse” or data lake are nearly analogous with agility and innovation. In many ways, the cloud makes data easier to manage, more accessible to a wider variety of users, and far faster to process. Companies literally can’t use data in a meaningful way without the a data lake vs data warehouse discussion.
When it comes to selecting between a data lake vs data warehouse for your data platform, however, the answer isn’t as straightforward. With the release of Amazon Redshift in 2013 followed by Snowflake, Google BigQuery, and others in the subsequent years, the market has become increasingly hot. Add data lakes such as S3 or Databricks to the mix, and the decision between data lake vs data warehouse becomes that much harder.
Whether you’re just getting started or are in the process of re-assessing your existing big data solution, here’s everything you need to know to choose the right data lake or data warehouse for your data stack:
What is data warehouse?
A data warehouse is a data repository that provides data storage and compute, usually leveraging SQL queries for data analytics use cases.
What is a data lake?
A data lake is a data repository that provides storage and compute for structured and unstructured data, oftentimes for streaming, machine learning, or data science use cases.
Data lake vs data warehouse: 3 key differences
Data lakes and data warehouses are both data storage repositories. Three key differences between a data warehouse and a data lake are how they provide storage, compute power, and metadata (contextual information about the data in your ecosystem).
- Metadata: Data warehouses and data lakes typically offer a way to manage and track all the databases, schemas, and tables that you create. These objects are often accompanied by additional information such as schema, data types, user-generated descriptions, or even freshness and other statistics about the data.
- Storage: Storage refers to the way in which the data warehouse and data lake physically store all the records that exist across all tables. By leveraging various kinds of storage technologies and data formats, data warehouses and data lakes can serve a wide range of use cases with desired cost and performance characteristics. Traditionally, a data lake stores raw structured, semi-structured and unstructured data without constraints, while data warehouses stores data in an orderly fashion with their corresponding metadata. These differences have converged over time (see the Lakehouse section below) as Databricks has enabled users to add structure and metadata via the Unity Catalog and Delta Lake while Snowflake introduced Apache Iceberg tables to bring the reliability and simplicity of SQL tables, while making it possible for engines like Apache Spark, Trino, Apache Flink, Presto, and Hive to safely work with the same tables, at the same time.
- Compute: Compute refers to the way in which the data warehouse or data lake perform calculations on the data records it stores. This is the engine that allows users to “query” data, ingest data, transform it – and more broadly, extract value from it. Frequently, these calculations are expressed via SQL. This is another area where the data lake vs data warehouse battle lines are starting to overlap. Snowflake’s Snowpark supports multiple programming languages such as Java, Python or Scala which are then executed as SQL functions. They also recently introduced Snowpark Python, a native Python experience with a pandas and PySpark-like API for data manipulation without the need to write verbose SQL. On the other side, Spark SQL can help turn languages like Python, R, and Scala into SQL commands.
Why do I need a data warehouse?
Data warehouses are fully integrated and managed solutions, making them simple to build and operate out-of-the-box. When using a data warehouse, you typically use metadata, storage and compute from a single solution, built and operated by a single vendor.
In your data lake vs data warehouse discussion consider data warehouses typically require more structure and schema, which often forces better data hygiene and results in less complexity when reading and consuming data.
Owing to its pre-packaged functionalities and strong support for SQL, data warehouses facilitate fast, actionable querying, making them great for data analytics teams.
Common data warehouse technologies include:
- Amazon Redshift: The first widely popular (and readily available) cloud data warehouse, Amazon Redshift sits on top of Amazon Web Services (AWS) and leverages source connectors to pipe data from raw data sources into relational storage. Redshift’s columnar storage structure and parallel processing makes it ideal for analytic workloads.
- Google BigQuery: Like Redshift, Google BigQuery leverages its mothership’s proprietary cloud platform (Google Cloud), uses a columnar storage format, and takes advantage of parallel processing for quick querying. Unlike Redshift, BigQuery is a serverless solution that scales according to usage patterns.
- Snowflake: Unlike Redshift or GCP which rely on their proprietary clouds to operate, Snowflake’s cloud data warehousing capabilities are powered by AWS, Google, Azure, and other public cloud infrastructure. Unlike Redshift, Snowflake allows users to pay separate fees for compute and storage, making the data warehouse a great option for teams looking for a more flexible pay structure.
Why do I need a data lake?
In the data lake vs data warehouse debate, consider data lakes are the do-it-yourself version of a data warehouse, allowing data engineering teams to pick and choose the various metadata, storage, and compute technologies they want to use depending on the needs of their systems. Common data lake technologies include:
- Metadata: Hive, Amazon Glue, Databricks.
- Storage: S3, Google Cloud Storage, Microsoft Azure Blob Storage, Hadoop HDFS.
- Compute: Apache Pig, Hive, Presto, Spark.
- Common formats: JSON, Apache Parquet, Apache Avro, Apache Hudl, Delta Lake.
Data lakes are ideal for data teams and data scientists looking to build a more customized platform, often supported by a handful (or more) of data engineers.
Some common features of data lakes include:
- Decoupled storage and compute: Not only can this functionality allow for substantial cost savings, but it also facilitates parsing and enriching of the data for real-time streaming and querying.
- Support for distributed compute: Distributed computing helps support the performance of large-scale data processing because it allows for better segmented query performance, more fault-tolerant design, and superior parallel data processing.
- Customization and interoperability: Owing to their “plug and chug” nature, data lakes support data platform scalability by making it easy for different elements of your stack to play well together as the data needs of your company evolve and mature.
- Largely built on open source technologies: This facilitates reduced vendor lock-in, and affords great customization, which works well for companies with large data engineering teams.
- Ability to handle unstructured or weakly structured data: Data lakes can support raw data, meaning that you have greater flexibility when it comes to working with your data, ideal for data scientists and data engineers. Working with raw data gives you more control over your aggregates and calculations.
- Supports sophisticated non-SQL programming models: This is a point of differentiation in the data lake vs data warehouse debate (but for how much longer is anyone’s guess). Unlike most data warehouses, data lakes support Apache Hadoop, Apache Spark, PySpark, and other frameworks for advanced data science and machine learning.
It’s important to note that many data warehouse solutions, including Snowflake and BigQuery, can support some of the above functionalities, which leads us to our next data lake vs data warehouse point…
Wait, there’s more! What is a data lakehouse?
Just when you thought the data lake vs data warehouse decision was tough enough, another data warehousing option has emerged as an increasingly popular one, particularly among data engineering teams.
Meet the data lakehouse, a solution that marries features of both data warehouses and data lakes, and as a result, combines traditional data analytics technologies with those built for more advanced computations such as machine learning.
Data lakehouses first came onto the scene when cloud warehouse providers began adding features that offer lake-style benefits, such as Redshift Spectrum or Delta Lake. Similarly, data lakes have been adding technologies that offer warehouse-style features, such as SQL functionality and schema. Today, the historical differences in the data lake vs warehouse discussion are narrowing so you can access the best of both words in one package.
The following functionalities are helping data lakehouses further blur the lines between the two technologies:
- High-performance SQL: technologies like Presto and Spark provide SQL interface at close to interactive speeds over data lakes. This opened the possibility of data lakes serving analysis and exploratory needs directly, without requiring summarization and ETL into traditional data warehouses.
- Schema: file formats like Parquet introduced more rigid schema to data lake tables, as well as a columnar format for greater query efficiency.
- Atomicity, Consistency, Isolation, and Durability (ACID): Data lake technologies like Delta Lake and Apache Hudi introduced greater reliability in write/read transactions, and takes lakes a step closer to the highly desirable ACID properties that are standard in traditional database technologies.
- Managed services: for teams that want to reduce the operational lift associated with building and running a data lake, cloud providers offer a variety of managed lake services. For example, Databricks offers a managed version of Apache Hive, Delta Lake, and Apache Spark while Amazon Athena offers a fully managed lake SQL query engine and Amazon’s Glue offers a fully managed metadata service.
With the rise of real-time data aggregation and streaming to inform lightspeed analytics (think Silicon Valley tech giant speeds: Uber, DoorDash, and Airbnb), data lakehouses are likely to rise in popularity and relevance for data teams across industries in the coming years.
So, data lake vs data warehouse: What should you choose?
There’s not an easy answer to the data lake vs data warehouse question. In fact, it’s no surprise that data teams frequently migrate from one data warehouse solution to another as the needs of their data organization shifts and evolves to meet the demands of data consumers (which nowadays, is nearly every functional area in the business, from Marketing and Sales to Operations and HR).
While data warehouses often make sense for data platforms whose primary use case is for data analysis and reporting, data lakes are becoming increasingly user-friendly, particularly via managed data lakehouse solutions like Dremio and open source projects like Delta Lake.
Increasingly, we’re finding that data teams are unwilling to settle on just one solution in the data lake vs data warehouse debate – and for good reason. As more use cases emerge and more stakeholders (with differing skill sets!) are involved, it is almost impossible for a single solution to serve all needs.
One data leader we talked to at a 5,000-person ride-sharing company told us that even though his data engineering team was adamant that they wanted to build a data lake, they ended up putting in place an in-house system for reporting, access control, and data quality that turned the final product into more of a data warehouse.
We find that regardless of what you choose in the data lake vs data warehouse decision, it’s important to apply the following best practices:
Align on the solution(s) that map to your company’s data goals.
If your company only uses one or two key data sources on a regular basis for a select few workflows, then it might not make sense to build a data lake from scratch, both in terms of time and resources. But if your company is trying to use data to inform everything under the sun, then a hybrid warehouse-lake solution may just be your ticket to fast, actionable insights for users across roles.
Know who your core users will be.
Will the primary users of your data platform be your company’s business intelligence team, distributed across several different functions? What about a dedicated team of data engineers? Or a few groups of data scientists running A/B tests with various data sets? All of the above? Regardless, choose the data lake vs data warehouse option that makes the most sense for the skill sets and needs of your users.
Don’t forget data observability.
Data warehouse, data lake, data lakehouse: it doesn’t matter. All three solutions (and any combination of them) will require a holistic approach to data governance and data quality. After all, your data platform is only as powerful and reliable as the data that informs it; if your data is broken, missing, or otherwise inaccurate (we call this problem data downtime), it doesn’t matter how advanced your pipelines are.
Your thoughtful investment in the latest and greatest data warehouse doesn’t matter if you can’t trust your data. To address this problem, some of the best data teams are leveraging data observability, an end-to-end approach to monitoring and alerting for issues in your data pipelines. More on that in a future article.
I’m excited to see where the data industry and the data lake vs data warehouse discussion is headed. I predict that a mature data stack will likely include more than one solution, and data organizations will ultimately benefit from greater cost savings, agility, and innovation.
At the end of the day, it’s not so much about choosing one tool or the other as it is about picking the right tool (or tools) for the job.
If you’re interested in building a better data platform or want to chat about data lake vs data warehouse, reach out to Lior Gavish and book a time to speak with us in the form below.