🧠 Second Brain


Search IconIcon to open search

Data Lake Table Formats (Open Data Formats)

Last updated Jul 9, 2024

Prominent table formats include Delta Lake, Apache Iceberg, and Apache Hudi.

Data lake table formats serve as databases-like features on top of distributed File Formats. Similar to a traditional table, these formats consolidate distributed files into a singular table, simplifying management. Consider them an abstraction layer that structures your physical data files into coherent tables.

GitHub Star History

# Market Updates

# General Features

# Comparisons: Hudi, Iceberg, Delta

A detailed comparison of these formats is available in Comparison of Data Lake Table Formats (Iceberg, Hudi and Delta Lake).

Typically, Parquet’s binary columnar file format is the prime choice for storing data for analytics. However, there are situations where you may want your table format to use other file formats like AVRO or ORC. Below is a chart that shows which table formats are allowed to make up the data files of a table.

# Format Conversion

Exploring tools like Delta Universal Format (UniForm) and XTable can facilitate format transitions.

# Additional Resources

Further insights are available on the Use transactional processing and my Data Lake/Lakehouse Guide where I wrote in more detail about this.

# Composable Data Stacks

Composable Data Stacks relate to table formats, as stacks like Lakehouse are built around open Table Formats.

Origin: Data Lake / Lakehouse Guide: Powered by Data Lake Table Formats (Delta Lake, Iceberg, Hudi)
Created 2022-06-10