Search

Search IconIcon to open search

Data Lake Table Formats (Open Table Formats)

Last updated by Simon Späti

Prominent table formats include Delta Lake, Apache Iceberg, and Apache Hudi.

Data lake table formats serve as databases-like features on top of distributed File Formats. Similar to a traditional table, these formats consolidate distributed files into a singular table, simplifying management. Consider them an abstraction layer that structures your physical data files into coherent tables.


GitHub Star History

My Latest Complete Write Ups

If you like to read my latest blog about this topics, not just the notes, here are my latest write-ups about this topics:

# Tools

Table format tools:

# Specialized Formats

# AI focused

# Why Open Table Formats

Because of the Open Data Architecture.

# Market Updates

# General Features

  • DML and SQL Support: Inserts, Upserts, Deletes.
    • Provides merging, updating, and deleting directly on distributed files.
    • Some formats also support Scala/Java and Python APIs in addition to SQL.
  • Backward compatible with Schema Evolution and Enforcement
    • Automatic Schema Evolution is a key feature in Table Formats, as changing formats is still a challenging task in today’s data engineering work. Evolution here means we can add new columns without breaking anything or even enlarging some types. You can even rename or reorder columns, although that might break backward compatibilities, but we can change one table and the Table Format takes care of updating it across all distributed files. Best of all, it does not require a rewrite of your table and underlying files.
  • ACID Transactions, Rollback, Concurrency Control (e.g., Delta has Optimistic Concurrency)
  • Time Travel, Audit History with Transaction Log (Delta Lake) and Rollback
    • Time travel allows the Table Format to version the big data stored in your data lake, enabling access to any historical version of that data. This simplifies data management, makes auditing easy, allows rolling back data in case of accidental bad writes or deletes, and helps reproduce experiments and reports.
    • All formats assist with GDPR compliance.
    • The Transaction Log (Open Table Formats) is an ordered record of every transaction performed on a table since its inception. For example, Delta Lake creates a single folder called _delta_log (details in Delta Lake). This log is a common component across many of its features, including ACID transactions, scalable metadata handling, and time travel.
    • Scalable Metadata Handling: These table formats are not only equipped to handle a large amount and big files, but they also manage metadata at scale with automatic checkpointing and summarization.
    • ℹ️ Helpful for GDPR compliance.
  • Time-travel: Enables reproducible queries using the exact same table snapshot, or lets users easily examine changes. Version rollback allows for quick correction of problems by resetting tables to a good state.
  • Partitioning / Partitioning Evolution
    • These formats handle the tedious and error-prone task of producing partition values for rows in a table and automatically skip unnecessary partitions and files. No extra filters are needed for fast queries, and the table layout can be updated as data or queries change.
  • File Sizing, Data Clustering with Compaction
    • Data can be compacted with OPTIMIZE - Delta Lake in Delta Lake and deletion of old versions can be managed by setting a retention date with VACUUM.
    • Data compaction is supported out-of-the-box, offering different rewrite strategies such as bin-packing or sorting to optimize file layout and size.
  • Unified Batch and Streaming Source and Sink (eliminating the need for Kappa Architecture)
    • Supports streaming ingestion, Built-in CDC sources & tools (Hudi).
    • It’s advantageous that it doesn’t matter if you’re reading from a stream or batch. Delta supports both a single API and a target sink. This is well explained in Beyond Lambda: Introducing Delta Architecture or through code examples. The often-used MERGE statement in SQL can be applied on your distributed files as well with Delta, including schema evolution and ACID transaction.
  • Data Sharing
    • For example, Delta Sharing: An open protocol for secure data sharing, making it simple to share data with other organizations regardless of the computing platforms they use.
  • Change Data Feed (CDF)
    • The CDF feature enables tables to track row-level changes between versions. When enabled, it records “change events” for all data written into the table, including row data and metadata indicating whether the row was inserted, deleted, or updated. Currently, this is supported mainly by Delta.

# Comparisons: Hudi, Iceberg, Delta

A detailed comparison of these formats is available in Comparison of Data Lake Table Formats (Iceberg, Hudi and Delta Lake).

Typically, Parquet’s binary columnar file format is the prime choice for storing data for analytics. However, there are situations where you may want your table format to use other file formats like AVRO or ORC. Below is a chart that shows which table formats are allowed to make up the data files of a table.



# Apache Hudi vs Apache Iceberg vs Delta Lake

Apache spark summit 2020 “A Thorough Comparison Of Delta Lake, Iceberg And Hudi”

# Iceberg compared to the rest

Full list of open table format comparison:

Feature Apache Iceberg Other Table Formats Benefits
ACID Transactions ✅ Full ACID transactions with optimistic concurrency Delta Lake: Full ACID with optimistic concurrency
Hudi: Full ACID with optimistic or pessimistic concurrency
⚠️ Lance: Versioning support but not full ACID
Ensures data consistency across operations and prevents data corruption during concurrent writes
Schema Evolution ✅ Full schema evolution (add, drop, rename, reorder, update types) Delta Lake: Similar capabilities
Hudi: Similar capabilities
⚠️ Lance: Basic support
Allows tables to evolve without rewriting data or breaking compatibility
Time Travel ✅ Supports time travel and version rollback Delta Lake: Supports
Hudi: Supports
Lance: Zero-copy automatic versioning
Reproduces queries at specific points in time, enables rollback to previous states
Partitioning ✅ Hidden partitioning with transformations (day, hour, bucket) and partition evolution ⚠️ Delta Lake: Standard partitioning
⚠️ Hudi: Similar with clustering
⚠️ Lance: Limited partitioning
Automatic partition pruning without explicit filters, can evolve partition strategy without rewriting data
File Format Support ✅ Supports Parquet, ORC, and Avro ⚠️ Delta Lake: Primarily Parquet
Hudi: Parquet, Avro, ORC
⚠️ Lance: Custom columnar format
Flexibility in storage choices to match performance needs
Copy-on-Write vs Merge-on-Read ✅ Supports both Delta Lake: Supports both
Hudi: Supports both
⚠️ Lance: Not explicitly defined
Balance between write performance and read performance
Data Skipping ✅ Column statistics in manifests for data skipping Delta Lake: Column stats in checkpoint
Hudi: Column stats with HFile formats
Lance: Supports data skipping
Improves query performance by skipping irrelevant data files
Compaction ✅ Data compaction with bin-packing, sorting, and Z-order Delta Lake: OPTIMIZE with Z-order
Hudi: Managed compaction services
⚠️ Lance: Not explicitly defined
Optimizes file layout and size for better performance
Incremental Processing ⚠️ Change queries for incremental data processing ⚠️ Delta Lake: Change Data Feed in 2.0+
Hudi: Incremental queries from beginning
Lance: Not explicitly defined
Enables efficient processing of only data changes
Ecosystem Integration ✅ Spark, Flink, Trino, Presto, Hive, DuckDB ⚠️ Delta Lake: Strong Databricks integration
Hudi: Spark, Presto, Hive, Flink
⚠️ Lance: Arrow, Pandas, Polars, DuckDB
Determines compatibility with existing data tools
Governance & Community ✅ Apache Software Foundation
Netflix, Tabular (acq. by Databricks)
Delta Lake: Linux Foundation, Databricks
Hudi: ASF, Uber
⚠️ Lance: Open source, LanceDB
Indicates project stability and development approach
Metadata Management ✅ Avro manifest files with metadata table ⚠️ Delta Lake: Parquet checkpoints
Hudi: MoR-based metadata table
⚠️ Lance: Different architecture
Impacts metadata performance and scaling
Concurrency Model ✅ Table-level validation for conflict detection Delta Lake: Optimistic concurrency
Hudi: Optimistic or pessimistic
Lance: Not explicitly defined
Determines how concurrent writers are handled
Real-time Updates ⚠️ Not optimized for real-time ⚠️ Delta Lake: Not optimized
Hudi: Streaming support
⚠️ Lance: Not optimized
Paimon: Optimized for real-time updates
Important for streaming use cases and low-latency requirements
Cloud Integration ✅ Works with all major clouds Delta Lake: Works with all clouds, optimized for Databricks
Hudi: Works with all clouds
Lance: Cloud-agnostic
Flexibility in deployment options
Deletion Support ✅ Position and equality deletes Delta Lake: Row-level deletes
Hudi: Row-level deletes
⚠️ Lance: Basic delete support
Affects GDPR compliance and data cleanup operations

# Format Conversion

Exploring tools like Delta Universal Format (UniForm) and XTable can facilitate format transitions.

# History and Evolution

gantt
    title Managed Iceberg Ecosystem
    dateFormat YYYY
    axisFormat %Y
    
    section File Formats
    Apache Hadoop :milestone, m1, 2006, 0d
    Apache Hive (Metastore)                  :milestone, m2, 2008, 0d
    RCFile                       :milestone, m3, 2010, 0d
    Apache ORC                   :milestone, m4, 2013, 0d
    Apache Parquet               :milestone, m5, 2013, 0d
    
    section Table Formats
    Apache Hudi                  :milestone, m6, 2016, 0d
    Delta Lake                   :milestone, m7, 2017, 0d
    Apache Iceberg               :milestone, m8, 2017, 0d
    Apache Paimon                :milestone, m9, 2022, 0d
    Databricks acquires Tabular                :milestone, m9, 2024, 0d
    
    section Conversions
	Delta UniForm    :milestone, m10, 2023, 0d
    Apache XTable                :milestone, m11, 2023, 0d
    
    section Catalogs
    Glue Catalog                 :milestone, m13, 2021, 0d
    Unity Catalog                :milestone, m12, 2022, 0d
    Apache Polaris Catalog       :milestone, m16, 2024, 0d
    Snowflake Horizon Catalog    :milestone, m14, 2024, 0d
	
	section Managed Services
    AWS S3 Tables         :milestone, m13, 2024, 0d
    Cloudflare R2 Data Catalog             :milestone, m14, 2025, 0d
    Managed Iceberg by Databricks          :milestone, m15, 2025, 0d

See more on Rill | The Open Table Format Revolution: Why Hyperscalers Are Betting on Managed Iceberg

# Hive

Probably the first version with which was based on a distributed storage accessable with SQL was Apache Hive and Hadoop, see Apache Hive.

# Table Format Catalogs

Open Table Format Catalogs

Further insights are available on the Use transactional processing and my Data Lake/Lakehouse Guide where I wrote in more detail about this.

# Composable Data Stacks

Composable Data Stacks relate to table formats, as stacks like Lakehouse are built around open Table Formats.

# Down sides and Limits of Open Table Format

Some of the downsides:

  • More fragmented ecosystem: Multiple competing formats (Iceberg, Delta, Hudi) -> this should get consolidated around Iceberg/Delta
  • Low-level: These formats still require technical expertise and choosing/configuring an compute-engine compared to managed data warehouse solutions
  • Potential compaction needed: Regular maintenance operations like compaction, optimization, and file cleanup are need to be done manually
  • Lesser write concurrency: Not meant for multiple concurrent writers compared to traditional databases
  • Added complexity: Requires implementing manual data governance, catalog management, and operational processes that are built-in with traditional data warehouses
  • Latency and Performance trade-offs: Separation of compute and storage introduces network latency, and query performance. While cost-effective, they won’t never be as fast as tightly integrated, proprietary Modern OLAP Systems
  • Catalog layer lock-in: Despite open formats, the catalog/metastore layer still presents vendor lock-in challenges, as compatibility between different catalog solutions remains limited

Also more on Iceberg Specific Limitation 2025.

# Further Reads


Origin: Data Lake / Lakehouse Guide: Powered by Data Lake Table Formats (Delta Lake, Iceberg, Hudi)
References:
Created 2022-06-10