🧠 Second Brain
Search
Data Lake Table Formats (Open Table Formats)
Prominent table formats include Delta Lake, Apache Iceberg, and Apache Hudi.
Data lake table formats serve as databases-like features on top of distributed File Formats. Similar to a traditional table, these formats consolidate distributed files into a singular table, simplifying management. Consider them an abstraction layer that structures your physical data files into coherent tables.
# Tools
Table format tools:
# Why Open Table Formats
Table Formats are a key part of a Data Platform. If you have a Closed-Source Data Platforms but use open table formats, you still own the data, and you are not as locked in. This means you can move the data; maybe you don’t even need to if they are in S3, and you just switch the compute engine to another vendor or an open-source one.
That’s why Microsoft Fabric uses and pushes Delta Lake under the hood, and Snowflake integrated Iceberg Table that are essentially Apache Iceberg tables.
%%{init: {'theme': 'base', 'themeVariables': { 'fontFamily': 'arial', 'fontSize': '16px'}}}%% flowchart TD subgraph Open ["Open Data Architecture"] direction TB subgraph O [" "] direction TB O2[Open Compute] O4["Open Table Format"] O5[File Format] O6[Storage] O2 --> O4 O4 --> O5 O5 --> O6 end end subgraph Closed ["Typical Closed Data Architecture"] direction TB subgraph C [" "] direction TB C2[Vendor Compute] C4[Proprietary Table Format] C5[Proprietary File Format] C6[Storage] C2 --> C4 C4 --> C5 C5 --> C6 end end %% SWAP connection C4 ---|SWAP| O4 %% Styling classDef subgraphStyle fill:#fff,stroke:#333,stroke-width:2px; classDef nodeStyle fill:#f9f9f9,stroke:#666,stroke-width:1px,rx:5px; classDef locked fill:#ffecec,stroke:#ff9999; classDef unlocked fill:#e6ffe6,stroke:#99cc99; classDef neutral fill:#ffffff,stroke:#666; class Open,Closed subgraphStyle; class O2,C2 nodeStyle; class O4 unlocked; class C4,C5 locked; class O5,O6,C6 neutral;
^6dfb4f
Inspired by Open Table Formats and the Open Data Lakehouse, In Perspective.
# Market Updates
- Databricks Data & AI Summit 2024: Databricks announced acquiring Tabular (Iceberg), the company behind Apache Iceberg. And as well as open sourcing Unity Catalog.
- Snowflake announced Polaris Catalog to integrate additional REST APIs to Iceberg and Snowflake.
- XTable (formerly: OneTable): OneTable is an omnidirectional converter. Similar to Project UniForm (Databricks) and Delta Universal Format (UniForm).
- Snowflake introduced Iceberg tables at their 2022 summit. (Delve into how Iceberg synergizes with Snowflake, particularly its open-source aspects.)
- Databricks at the Databricks Data & AI Summit 2022, declared Delta Lake fully open-source, including all features in Delta Lake 2.0 (e.g., z-ordering, optimization). This solidifies its position as a leading format, further enhanced by the open-source sharing feature Delta Sharing and Delta Live Table. They’re also developing an open-source Market Place - Databricks.
Comment on Reddit.
# General Features
- DML and SQL Support: Inserts, Upserts, Deletes.
- Provides merging, updating, and deleting directly on distributed files.
- Some formats also support Scala/Java and Python APIs in addition to SQL.
- Backward compatible with Schema Evolution and Enforcement
- Automatic Schema Evolution is a key feature in Table Formats, as changing formats is still a challenging task in today’s data engineering work. Evolution here means we can add new columns without breaking anything or even enlarging some types. You can even rename or reorder columns, although that might break backward compatibilities, but we can change one table and the Table Format takes care of updating it across all distributed files. Best of all, it does not require a rewrite of your table and underlying files.
- ACID Transactions, Rollback, Concurrency Control (e.g., Delta has Optimistic Concurrency)
- A transaction is designed to either commit all changes or rollback, ensuring you never end up in an inconsistent state.
- Integrated with various cluster-computing frameworks (e.g., Iceberg with Apache Spark, Trino, Flink, Presto, Apache Hive, Impala; Hudi with Apache Spark, Presto, Trino, Hive; and Delta with Spark, Presto, Trino, Athena).
- Time Travel, Audit History with Transaction Log (Delta Lake) and Rollback
- Time travel allows the Table Format to version the big data stored in your data lake, enabling access to any historical version of that data. This simplifies data management, makes auditing easy, allows rolling back data in case of accidental bad writes or deletes, and helps reproduce experiments and reports.
- All formats assist with GDPR compliance.
- The Transaction Log is an ordered record of every transaction performed on a table since its inception. For example, Delta Lake creates a single folder called
_delta_log
(details in Delta Lake). This log is a common component across many of its features, including ACID transactions, scalable metadata handling, and time travel. - Scalable Metadata Handling: These table formats are not only equipped to handle a large amount and big files, but they also manage metadata at scale with automatic checkpointing and summarization.
- ℹ️ Helpful for GDPR compliance.
- Time-travel: Enables reproducible queries using the exact same table snapshot, or lets users easily examine changes. Version rollback allows for quick correction of problems by resetting tables to a good state.
- Avoids the need to implement complex Slowly Changing Dimension (Type 2), and with all transactions being recorded in the transaction log and having each version handy, you can extract changes as you would with Change Data Capture (CDC) (if done often enough to not lose intermediate changes).
- Partitioning /
Partitioning Evolution
- These formats handle the tedious and error-prone task of producing partition values for rows in a table and automatically skip unnecessary partitions and files. No extra filters are needed for fast queries, and the table layout can be updated as data or queries change.
- File Sizing, Data Clustering with Compaction
- Data can be compacted with OPTIMIZE - Delta Lake in Delta Lake and deletion of old versions can be managed by setting a retention date with VACUUM.
- Data compaction is supported out-of-the-box, offering different rewrite strategies such as bin-packing or sorting to optimize file layout and size.
- Unified Batch and Streaming Source and Sink (eliminating the need for Lambda Architecture)
- Supports streaming ingestion, Built-in CDC sources & tools (Hudi).
- It’s advantageous that it doesn’t matter if you’re reading from a stream or batch. Delta supports both a single API and a target sink. This is well explained in Beyond Lambda: Introducing Delta Architecture or through code examples. The often-used MERGE statement in SQL can be applied on your distributed files as well with Delta, including schema evolution and ACID transaction.
- Data Sharing
- For example, Delta Sharing: An open protocol for secure data sharing, making it simple to share data with other organizations regardless of the computing platforms they use.
-
Change Data Feed (CDF)
- The CDF feature enables tables to track row-level changes between versions. When enabled, it records “change events” for all data written into the table, including row data and metadata indicating whether the row was inserted, deleted, or updated. Currently, this is supported mainly by Delta.
# Comparisons: Hudi, Iceberg, Delta
A detailed comparison of these formats is available in Comparison of Data Lake Table Formats (Iceberg, Hudi and Delta Lake).
Typically, Parquet’s binary columnar file format is the prime choice for storing data for analytics. However, there are situations where you may want your table format to use other file formats like AVRO or ORC. Below is a chart that shows which table formats are allowed to make up the data files of a table.
# Format Conversion
Exploring tools like Delta Universal Format (UniForm) and XTable can facilitate format transitions.
# History
The History and Evolution of Open Table Formats - Part I
# Additional Resources
-
Onehouse’s Feature Comparison
- Update log
- 8/11/22 - Original publish date
- 1/11/23 - Refresh feature comparisons, added community stats + benchmarks
- 1/12/23 - Databricks contributed few minor corrections
- 10/31/23 - Minor edits
- 1/31/24 - Minor update about current state of OneTable
- Update log
- Dremio’s Comparison
- LakeFS’s Overview
- The ultimate guide to table format internals - all my writing so far — Jack Vanlightly
Further insights are available on the Use transactional processing and my Data Lake/Lakehouse Guide where I wrote in more detail about this.
# Composable Data Stacks
Composable Data Stacks relate to table formats, as stacks like Lakehouse are built around open Table Formats.
Origin:
Data Lake / Lakehouse Guide: Powered by Data Lake Table Formats (Delta Lake, Iceberg, Hudi)
References:
Created 2022-06-10