Data Lake Table Formats (Open Table Formats)

Last updatedUpdated: May 13, 2026 by Simon Späti · CreatedCreated: Jun 10, 2022

Prominent table formats include Delta Lake, Apache Iceberg, and Apache Hudi.

Data lake table formats serve as databases-like features on top of distributed File Formats. Similar to a traditional table, these formats consolidate distributed files into a singular table, simplifying management. Consider them an abstraction layer that structures your physical data files into coherent tables.

GitHub Star History | Also check DuckDB extension downloads comparision Iceberg vs. Delta vs. DuckLake

My Latest Complete Write Ups

If you like to read my latest blog about this topics, not just the notes, here are my latest write-ups about this topics:

2025: The Open Table Format Revolution: Why Hyperscalers Are Betting on Managed Iceberg

2025: Open Table Formats and Catalogs

2022: Data Lake / Lakehouse Guide: Powered by Data Lake Table Formats (Delta Lake, Iceberg, Hudi)

# Tools

Table format tools:

# Specialized Formats

Apache Paimon: Real-time focused
DuckLake (a combination of OTF and Open Catalogs)
Havasu (Spatial, on top of Iceberg)
IndexTables (Spark): An experimental open-table format for Apache Spark that enables fast retrieval and full-text search across large-scale data. It integrates seamlessly with Spark SQL, allowing you to combine powerful search capabilities with joins, aggregations, and standard SQL operations.

# AI focused

Bauplan (closed-source): More a Lakehouse solution like Dremio, etc.
AgentFS: The filesystem for agents.

# Why Open Table Formats

Because of the Open Data Architecture.

# Updates and Trends

# Trends

I guess for long-time, delta and iceberg are both going to be used. And there’s new one every week, almost (Lance, Paimont, Havasus, IndexTables, Bauplan, AgentFS, etc.)

So it’s hard to say. But I would guess that interfaces are getting interchangeable, like the S3 API used by any other object storage (see minio, S3 Storage Alternatives).

# Market Updates

Databricks Data & AI Summit 2024: Databricks announced acquiring Tabular (Iceberg), the company behind Apache Iceberg. And as well as open sourcing Unity Catalog.
Snowflake announced Polaris Catalog to integrate additional REST APIs to Iceberg and Snowflake.
XTable (formerly: OneTable): OneTable is an omnidirectional converter. Similar to Project UniForm (Databricks) and Delta Universal Format (UniForm).
Snowflake introduced Iceberg tables at their 2022 summit. (Delve into how Iceberg synergizes with Snowflake, particularly its open-source aspects.)
Databricks at the Databricks Data & AI Summit 2022, declared Delta Lake fully open-source, including all features in Delta Lake 2.0 (e.g., z-ordering, optimization). This solidifies its position as a leading format, further enhanced by the open-source sharing feature Delta Sharing and Delta Live Table. They’re also developing an open-source Market Place - Databricks.
- Comment on Reddit.

# General Features

DML and SQL Support: Inserts, Upserts, Deletes.
- Provides merging, updating, and deleting directly on distributed files.
- Some formats also support Scala/Java and Python APIs in addition to SQL.
Backward compatible with Schema Evolution and Enforcement
- Automatic Schema Evolution is a key feature in Table Formats, as changing formats is still a challenging task in today’s data engineering work. Evolution here means we can add new columns without breaking anything or even enlarging some types. You can even rename or reorder columns, although that might break backward compatibilities, but we can change one table and the Table Format takes care of updating it across all distributed files. Best of all, it does not require a rewrite of your table and underlying files.
ACID Transactions, Rollback, Concurrency Control (e.g., Delta has Optimistic Concurrency)
- A transaction is designed to either commit all changes or rollback, ensuring you never end up in an inconsistent state.
- Integrated with various cluster-computing frameworks (e.g., Iceberg with Apache Spark, Trino, Flink, Presto, Apache Hive, Impala; Hudi with Apache Spark, Presto, Trino, Hive; and Delta with Spark, Presto, Trino, Athena).
Time Travel, Audit History with Transaction Log (Delta Lake) and Rollback
- Time travel allows the Table Format to version the big data stored in your data lake, enabling access to any historical version of that data. This simplifies data management, makes auditing easy, allows rolling back data in case of accidental bad writes or deletes, and helps reproduce experiments and reports.
- All formats assist with GDPR compliance.
- The Transaction Log (Open Table Formats) is an ordered record of every transaction performed on a table since its inception. For example, Delta Lake creates a single folder called _delta_log (details in Delta Lake). This log is a common component across many of its features, including ACID transactions, scalable metadata handling, and time travel.
- Scalable Metadata Handling: These table formats are not only equipped to handle a large amount and big files, but they also manage metadata at scale with automatic checkpointing and summarization.
- ℹ️ Helpful for GDPR compliance.
Time-travel: Enables reproducible queries using the exact same table snapshot, or lets users easily examine changes. Version rollback allows for quick correction of problems by resetting tables to a good state.
- Avoids the need to implement complex Slowly Changing Dimension (Type 2), and with all transactions being recorded in the transaction log and having each version handy, you can extract changes as you would with Change Data Capture (CDC) (if done often enough to not lose intermediate changes).
Partitioning / Partitioning Evolution
- These formats handle the tedious and error-prone task of producing partition values for rows in a table and automatically skip unnecessary partitions and files. No extra filters are needed for fast queries, and the table layout can be updated as data or queries change.
File Sizing, Data Clustering with Compaction
- Data can be compacted with OPTIMIZE - Delta Lake in Delta Lake and deletion of old versions can be managed by setting a retention date with VACUUM.
- Data compaction is supported out-of-the-box, offering different rewrite strategies such as bin-packing or sorting to optimize file layout and size.
Unified Batch and Streaming Source and Sink (eliminating the need for Kappa Architecture)
- Supports streaming ingestion, Built-in CDC sources & tools (Hudi).
- It’s advantageous that it doesn’t matter if you’re reading from a stream or batch. Delta supports both a single API and a target sink. This is well explained in Beyond Lambda: Introducing Delta Architecture or through code examples. The often-used MERGE statement in SQL can be applied on your distributed files as well with Delta, including schema evolution and ACID transaction.
Data Sharing
- For example, Delta Sharing: An open protocol for secure data sharing, making it simple to share data with other organizations regardless of the computing platforms they use.
Change Data Feed (CDF)
- The CDF feature enables tables to track row-level changes between versions. When enabled, it records “change events” for all data written into the table, including row data and metadata indicating whether the row was inserted, deleted, or updated. Currently, this is supported mainly by Delta.

# Comparisons: Hudi, Iceberg, Delta

A detailed comparison of these formats is available in Comparison of Data Lake Table Formats (Iceberg, Hudi and Delta Lake).

Typically, Parquet’s binary columnar file format is the prime choice for storing data for analytics. However, there are situations where you may want your table format to use other file formats like AVRO or ORC. Below is a chart that shows which table formats are allowed to make up the data files of a table.

# Apache Hudi vs Apache Iceberg vs Delta Lake

Apache spark summit 2020 “A Thorough Comparison Of Delta Lake, Iceberg And Hudi”

# Iceberg compared to the rest

Full list of open table format comparison:

Feature	Apache Iceberg	Other Table Formats	Benefits
ACID Transactions	✅ Full ACID transactions with optimistic concurrency	✅ Delta Lake: Full ACID with optimistic concurrency ✅ Hudi: Full ACID with optimistic or pessimistic concurrency ⚠️ Lance: Versioning support but not full ACID	Ensures data consistency across operations and prevents data corruption during concurrent writes
Schema Evolution	✅ Full schema evolution (add, drop, rename, reorder, update types)	✅ Delta Lake: Similar capabilities ✅ Hudi: Similar capabilities ⚠️ Lance: Basic support	Allows tables to evolve without rewriting data or breaking compatibility
Time Travel	✅ Supports time travel and version rollback	✅ Delta Lake: Supports ✅ Hudi: Supports ✅ Lance: Zero-copy automatic versioning	Reproduces queries at specific points in time, enables rollback to previous states
Partitioning	✅ Hidden partitioning with transformations (day, hour, bucket) and partition evolution	⚠️ Delta Lake: Standard partitioning ⚠️ Hudi: Similar with clustering ⚠️ Lance: Limited partitioning	Automatic partition pruning without explicit filters, can evolve partition strategy without rewriting data
File Format Support	✅ Supports Parquet, ORC, and Avro	⚠️ Delta Lake: Primarily Parquet ✅ Hudi: Parquet, Avro, ORC ⚠️ Lance: Custom columnar format	Flexibility in storage choices to match performance needs
Copy-on-Write vs Merge-on-Read	✅ Supports both	✅ Delta Lake: Supports both ✅ Hudi: Supports both ⚠️ Lance: Not explicitly defined	Balance between write performance and read performance
Data Skipping	✅ Column statistics in manifests for data skipping	✅ Delta Lake: Column stats in checkpoint ✅ Hudi: Column stats with HFile formats ✅ Lance: Supports data skipping	Improves query performance by skipping irrelevant data files
Compaction	✅ Data compaction with bin-packing, sorting, and Z-order	✅ Delta Lake: OPTIMIZE with Z-order ✅ Hudi: Managed compaction services ⚠️ Lance: Not explicitly defined	Optimizes file layout and size for better performance
Incremental Processing	⚠️ Change queries for incremental data processing	⚠️ Delta Lake: Change Data Feed in 2.0+ ✅ Hudi: Incremental queries from beginning ❌ Lance: Not explicitly defined	Enables efficient processing of only data changes
Ecosystem Integration	✅ Spark, Flink, Trino, Presto, Hive, DuckDB	⚠️ Delta Lake: Strong Databricks integration ✅ Hudi: Spark, Presto, Hive, Flink ⚠️ Lance: Arrow, Pandas, Polars, DuckDB	Determines compatibility with existing data tools
Governance & Community	✅ Apache Software Foundation Netflix, Tabular (acq. by Databricks)	✅ Delta Lake: Linux Foundation, Databricks ✅ Hudi: ASF, Uber ⚠️ Lance: Open source, LanceDB	Indicates project stability and development approach
Metadata Management	✅ Avro manifest files with metadata table	⚠️ Delta Lake: Parquet checkpoints ✅ Hudi: MoR-based metadata table ⚠️ Lance: Different architecture	Impacts metadata performance and scaling
Concurrency Model	✅ Table-level validation for conflict detection	✅ Delta Lake: Optimistic concurrency ✅ Hudi: Optimistic or pessimistic ❌ Lance: Not explicitly defined	Determines how concurrent writers are handled
Real-time Updates	⚠️ Not optimized for real-time	⚠️ Delta Lake: Not optimized ✅ Hudi: Streaming support ⚠️ Lance: Not optimized ✅ Paimon: Optimized for real-time updates	Important for streaming use cases and low-latency requirements
Cloud Integration	✅ Works with all major clouds	✅ Delta Lake: Works with all clouds, optimized for Databricks ✅ Hudi: Works with all clouds ✅ Lance: Cloud-agnostic	Flexibility in deployment options
Deletion Support	✅ Position and equality deletes	✅ Delta Lake: Row-level deletes ✅ Hudi: Row-level deletes ⚠️ Lance: Basic delete support	Affects GDPR compliance and data cleanup operations

# Format Conversion

Exploring tools like Delta Universal Format (UniForm) and XTable can facilitate format transitions.

# History and Evolution

gantt
    title Managed Iceberg Ecosystem
    dateFormat YYYY
    axisFormat %Y
    
    section File Formats
    Apache Hadoop :milestone, m1, 2006, 0d
    Apache Hive (Metastore)                  :milestone, m2, 2008, 0d
    RCFile                       :milestone, m3, 2010, 0d
    Apache ORC                   :milestone, m4, 2013, 0d
    Apache Parquet               :milestone, m5, 2013, 0d
    
    section Table Formats
    Apache Hudi                  :milestone, m6, 2016, 0d
    Delta Lake                   :milestone, m7, 2017, 0d
    Apache Iceberg               :milestone, m8, 2017, 0d
    Apache Paimon                :milestone, m9, 2022, 0d
    Databricks acquires Tabular                :milestone, m9, 2024, 0d
    
    section Conversions
	Delta UniForm    :milestone, m10, 2023, 0d
    Apache XTable                :milestone, m11, 2023, 0d
    
    section Catalogs
    Glue Catalog                 :milestone, m13, 2021, 0d
    Unity Catalog                :milestone, m12, 2022, 0d
    Apache Polaris Catalog       :milestone, m16, 2024, 0d
    Snowflake Horizon Catalog    :milestone, m14, 2024, 0d
	
	section Managed Services
    AWS S3 Tables         :milestone, m13, 2024, 0d
    Cloudflare R2 Data Catalog             :milestone, m14, 2025, 0d
    Managed Iceberg by Databricks          :milestone, m15, 2025, 0d

See more on Rill | The Open Table Format Revolution: Why Hyperscalers Are Betting on Managed Iceberg

More fragmented ecosystem: Multiple competing formats (Iceberg, Delta, Hudi) -> this should get consolidated around Iceberg/Delta
Low-level: These formats still require technical expertise and choosing/configuring an compute-engine compared to managed data warehouse solutions
Potential compaction needed: Regular maintenance operations like compaction, optimization, and file cleanup are need to be done manually
Lesser write concurrency: Not meant for multiple concurrent writers compared to traditional databases
Added complexity: Requires implementing manual data governance, catalog management, and operational processes that are built-in with traditional data warehouses
Latency and Performance trade-offs: Separation of compute and storage introduces network latency, and query performance. While cost-effective, they won’t never be as fast as tightly integrated, proprietary Modern OLAP Systems
Catalog layer lock-in: Despite open formats, the catalog/metastore layer still presents vendor lock-in challenges, as compatibility between different catalog solutions remains limited

Also more on Iceberg Specific Limitation 2025.

# Further Reads

Onehouse’s Feature Comparison
- Update log
  - 8/11/22 - Original publish date
  - 1/11/23 - Refresh feature comparisons, added community stats + benchmarks
  - 1/12/23 - Databricks contributed few minor corrections
  - 10/31/23 - Minor edits
  - 1/31/24 - Minor update about current state of OneTable
  - 10/02/25 - Refreshed comparisons with latest releases
Dremio’s Comparison
LakeFS’s Overview
The ultimate guide to table format internals - all my writing so far — Jack Vanlightly

Origin: Data Lake / Lakehouse Guide: Powered by Data Lake Table Formats (Delta Lake, Iceberg, Hudi)