🧠 Second Brain


Search IconIcon to open search

Data Lakehouse

Last updated Jun 21, 2024

A data lakehouse is an innovative data management architecture that seamlessly integrates the flexibility, cost-efficiency, and scalability of Data Lakes with the robust data management features and ACID Transactions typical of Data Warehouses. It leverages Data Lake Table Format to facilitate Business Intelligence and Machine Learning across comprehensive data sets.

The concept of “data lakehouse” was first introduced by Jeremy Engle in these slides in July 2017. Its significance surged when Databricks released its whitepaper cidrdb paper in 2021. Interestingly, they referred to it simply as “lakehouse,” omitting ‘data.’ Notably, the initial concept has evolved, especially with the Data Lake Table Format now incorporating database features atop distributed file systems.

For a detailed comparison, see Lakehouse vs Data Warehouse.

From Emerging Architectures for Modern Data Infrastructure - a16z:

“There is growing recognition and clarity around the lakehouse architecture. Supported by a range of vendors, including AWS, Databricks, Google Cloud, Starburst, and Dremio, as well as data warehouse pioneers, the lakehouse combines a robust storage layer with powerful data processing engines like Spark, Presto, Apache Druid/Clickhouse, Python libraries, and more.”

# Data Lakehouse Platform

The Databricks Lakehouse Platform merges the best aspects of data lakes and warehouses, offering the reliability, robust Data Governance, and performance of warehouses alongside the openness and flexibility of lakes. This unified approach streamlines your data stack, reducing silos across data engineering, analytics, BI, data science, and machine learning. Built on open source and standards, it ensures maximum flexibility, with a common approach to data management, security, and governance that enhances efficiency and innovation.

# Databricks Lakehouse Use-Cases

Let’s delve into some Delta Lake on Databricks examples, highlighting three key use cases for a Data Lakehouse:

ℹ️ For comprehensive data governance, integrate Airbyte with Databricks Unity Catalog - Databricks. This enables row-level security, secure data sharing with Delta Sharing, a built-in data catalog, enhanced data search, and more.

# Open Source Data Lakehouse

Explore the concept further in Building a serverless Data Lakehouse from spare parts.

# Architecture

Different components of a Lakehouse:

# History

Data Engineering Whitepapers

References: Lakehouse AMA
Last Modified: 2021-10-24
Created: 2021-10-24