The term ‘data lakehouse’ entered the data and analytics lexicon over the last few years. Often uttered flippantly to describe the result of the theoretical combination of a data warehouse with data lake functionality, usage of the term became more serious and more widespread in early 2020 as Databricks adopted it to describe its approach of marrying the data structure and data management features of the data warehouse with the low-cost storage used for data lakes.
Databricks was not the first to start using the data lakehouse terminology, however. Amazon Web Services (AWS) previously used the term (or in its case, ‘lake house’) in late 2019 in relation to Amazon Redshift Spectrum, its service that enables users of its Amazon Redshift data warehouse service to apply queries to data stored in its Amazon S3 cloud service.
In fact, the first use of the term by a vendor we have found can be attributed to Snowflake, which in late 2017 promoted that its customer, Jellyvision, was using Snowflake to combine schemaless and structured data processing in what Jellyvision described as a data lakehouse.
While Snowflake’s marketing has not run with the lakehouse terminology, preferring the term ‘data cloud’ to describe its ability to support multiple data processing and analytics workloads, AWS has very much picked up on it as a term to describe its combined portfolio of data and analytics services – placing its ‘lake house architecture’ front and center of its data and analytics announcements at re:Invent 2020.
Since a quick internet search returns nearly twice as many results for ‘data lakehouse’ than ‘data lake house,’ we will continue to use the former from this point on, unless specifically referring to AWS’s ‘lake house architecture.’ Either way, it is worth exploring the term, and the products and services it is being applied to, in more detail.
So what exactly is a data lakehouse?
As noted above, the simplest description of a data lakehouse is that it is an environment designed to combine the data structure and data management features of a data warehouse with the low-cost storage of a data lake. As we noted in July when we examined Databricks’ evolving strategy, we see wisdom in the desire to bring the structured analytics advantages of data warehousing to data stored in low-cost cloud-based data lakes, especially for data types and workloads that do not lend themselves naturally to relational databases.
While the data lake concept promises a more agile environment than a traditional data warehouse approach – one that is better suited to changing data and evolving business use cases – early initiatives suffered from a lack of appropriate data engineering processes and data management and governance functionality, making them relatively inaccessible for general business or self-service users.
The data lakehouse blurs the lines between data lakes and data warehousing by maintaining the cost and flexibility advantages of persisting data in cloud storage while enabling schema to be enforced for curated subsets of data in specific conceptual zones of the data lake, or an associated analytic database, in order to accelerate analysis and business decision-making.
One of the key enablers of the lakehouse concept is a structured transactional layer. Databricks added this capability to its Unified Analytics Platform (which provides Spark-based data processing for data in AWS or Microsoft Azure cloud storage) in April 2019 with the launch of Delta Lake.
Now an open source project of the Linux Foundation, Delta Lake provides a structured transactional layer with support for ACID (atomic, consistent, isolated, durable) transactions, updates and deletes, and schema enforcement. Another project that we have seen increasing interest in is The Apache Software Foundation’s Apache Iceberg, which provides a table format for large volumes of data that supports snapshots and schema evolution.
Databricks also boosted its data lakehouse portfolio in July with the delivery of Delta Engine, a complementary high-performance query engine for query acceleration. More recently, it added SQL Analytics, which is built on Delta Lake and the analytics dashboarding functionality Databricks acquired with Redash.
AWS also made a couple of significant announcements at re:Invent in relation to new functionality of the lake house architecture, in addition to evolving its positioning to describe a combination of the Amazon S3 cloud storage and Amazon Athena interactive query services, the AWS Glue data integration service, and AWS Lake Formation.
Specifically, AWS announced the preview release of AWS Lake Formation transactions, which will allow multiple users to concurrently insert, delete and modify data stored in governed tables – a new Amazon S3 table type that supports ACID transactions. AWS also announced the gated preview release of AWS Glue Elastic Views, a new capability of AWS Glue that enables users to create materialized views to combine and replicate data across multiple data stores (initially Amazon DynamoDB, Amazon S3, Amazon Redshift and Amazon Elasticsearch Service, with Amazon RDS, Amazon Aurora and others to follow).
In addition to Databricks and AWS, there are several other vendors that could be said to offer a data lakehouse, even if they do not use the term.
For example, if the distinguishing feature of the lakehouse is that it combines the data structure and data management features of a data warehouse with the low-cost storage of a data lake, then a case could be made for Google having pioneered the concept with its Dremel research project, which provided an ad hoc SQL query system to analyze data stored in Google’s Colossus file system (the successor to Google File System). Dremel was later commercialized as Google BigQuery.
Similarly, while Microsoft does not describe its Azure Synapse Analytics as a lakehouse, it could certainly be considered to fit the bill with its combination of the former Azure SQL Data Warehouse functionality with big-data processing (Apache Spark), data integration tooling and the ability to leverage Azure Data Lake Storage as a common storage layer.
IBM could also be said to be delivering a lakehouse (even if the company wouldn’t describe it as one), since its IBM Cloud Data Lake Reference Architecture involves a combination of interactive query (IBM SQL Query), data warehousing (IBM Db2 Warehouse) and analytics (Watson Studio and Watson Machine Learning), as well as IBM Object Storage (among other things).
Cloudera has never been a fan of the term ‘data lake,’ let alone ‘lakehouse,’ preferring ‘enterprise data cloud,’ so we don’t expect it to jump aboard the lakehouse bandwagon. Either way, the company already offers a combination of structured data warehousing, machine learning and other services to enable the management and analysis of data in object storage (both on-premises and in the cloud).
As noted above, while Snowflake has promoted its customers claiming to use its offering as a lakehouse, the company itself has a preference for ‘data cloud.’ Snowflake is certainly looking to position itself as more than just a data-warehousing provider, however, and recently talked up its credentials as a data lake by adding early preview support for the processing and analysis of unstructured data to its existing native support for semi-structured data, as well as the use of external tables to enable the analysis of semi-structured and unstructured data in external resources.
The 451 Take
Love it or hate it (and a quick internet search will show that there are plenty of data-warehousing aficionados who hate it), the concept of the lakehouse appears to be here to stay. The ecosystem of vendors offering functionality that fits the general description is growing steadily. Not all of them are using the term, however, and those that are define (and spell) it differently, so there is scope for confusion. Overall, however, the trend toward using cloud object storage services as a data lake to store large volumes of structured, semi-structured and unstructured data is not going to diminish anytime soon, and there are clear performance and efficiency advantages in bringing structured data processing concepts and functionality to that data, rather than having to export it into external data-warehousing environments for analysis. Whether you consider the lakehouse a truly new architecture or a natural evolution of the data lake, the addition of support for ACID transactions, updates and deletes bring this theoretical concept closer to reality in 2021.