Data Lake Table Formats
Data lake table formats are pivotal as they serve as databases on data lakes. Similar to a traditional table, these formats consolidate distributed files into a singular table, simplifying management. Consider them an abstraction layer that structures your physical data files into coherent tables.
# Market Updates
- Snowflake introduced Iceberg tables at their 2022 summit. (Delve into how Iceberg synergizes with Snowflake, particularly its open-source aspects.)
- Databricks at the Databricks Data & AI Summit 2022, declared Delta Lake fully open-source, including all features in Delta Lake 2.0 (e.g., z-ordering, optimization). This solidifies its position as a leading format, further enhanced by the open-source sharing feature Delta Sharing and Delta Live Table. They’re also developing an open-source Market Place - Databricks.
Comment on Reddit.
# General Features
- DML and SQL Support: Inserts, Upserts, Deletes.
- Provides merging, updating, and deleting directly on distributed files.
- Some formats also support Scala/Java and Python APIs in addition to SQL.
- Backward compatible with Schema Evolution and Enforcement
- Automatic Schema Evolution is a key feature in Table Formats, as changing formats is still a challenging task in today’s data engineering work. Evolution here means we can add new columns without breaking anything or even enlarging some types. You can even rename or reorder columns, although that might break backward compatibilities, but we can change one table and the Table Format takes care of updating it across all distributed files. Best of all, it does not require a rewrite of your table and underlying files.
- ACID Transactions, Rollback, Concurrency Control (e.g., Delta has Optimistic Concurrency)
- A transaction is designed to either commit all changes or rollback, ensuring you never end up in an inconsistent state.
- Integrated with various cluster-computing frameworks (e.g., Iceberg with Apache Spark, Trino, Flink, Presto, Apache Hive, Impala; Hudi with Apache Spark, Presto, Trino, Hive; and Delta with Spark, Presto, Trino, Athena).
- Time Travel, Audit History with Transaction Log (Delta Lake) and Rollback
- Time travel allows the Table Format to version the big data stored in your data lake, enabling access to any historical version of that data. This simplifies data management, makes auditing easy, allows rolling back data in case of accidental bad writes or deletes, and helps reproduce experiments and reports.
- All formats assist with GDPR compliance.
- The Transaction Log is an ordered record of every transaction performed on a table since its inception. For example, Delta Lake creates a single folder called
_delta_log(details in Delta Lake). This log is a common component across many of its features, including ACID transactions, scalable metadata handling, and time travel.
- Scalable Metadata Handling: These table formats are not only equipped to handle a large amount and big files, but they also manage metadata at scale with automatic checkpointing and summarization.
- ℹ️ Helpful for GDPR compliance.
- Time-travel: Enables reproducible queries using the exact same table snapshot, or lets users easily examine changes. Version rollback allows for quick correction of problems by resetting tables to a good state.
- Partitioning /
- These formats handle the tedious and error-prone task of producing partition values for rows in a table and automatically skip unnecessary partitions and files. No extra filters are needed for fast queries, and the table layout can be updated as data or queries change.
- File Sizing, Data Clustering with Compaction
- Unified Batch and Streaming Source and Sink (eliminating the need for Lambda Architecture)
- Supports streaming ingestion, Built-in CDC sources & tools (Hudi).
- It’s advantageous that it doesn’t matter if you’re reading from a stream or batch. Delta supports both a single API and a target sink. This is well explained in Beyond Lambda: Introducing Delta Architecture or through code examples. The often-used MERGE statement in SQL can be applied on your distributed files as well with Delta, including schema evolution and ACID transaction.
- Data Sharing
- For example, Delta Sharing: An open protocol for secure data sharing, making it simple to share data with other organizations regardless of the computing platforms they use.
Change Data Feed (CDF)
- The CDF feature enables tables to track row-level changes between versions. When enabled, it records “change events” for all data written into the table, including row data and metadata indicating whether the row was inserted, deleted, or updated. Currently, this is supported mainly by Delta.
# Comparisons: Hudi, Iceberg, Delta
A detailed comparison of these formats is available in Comparison of Data Lake Table Formats (Iceberg, Hudi and Delta Lake).
Typically, Parquet’s binary columnar file format is the prime choice for storing data for analytics. However, there are situations where you may want your table format to use other file formats like AVRO or ORC. Below is a chart that shows which table formats are allowed to make up the data files of a table.
# Format Conversion
# Additional Resources