🧠 Second Brain

Search

Search IconIcon to open search

Data Lake Table Formats

Last updated Apr 16, 2024

Prominent table formats include Delta Lake, Apache Iceberg, and Apache Hudi.

Data lake table formats are pivotal as they serve as databases on data lakes. Similar to a traditional table, these formats consolidate distributed files into a singular table, simplifying management. Consider them an abstraction layer that structures your physical data files into coherent tables.


GitHub Star History

# Market Updates

# General Features

# Comparisons: Hudi, Iceberg, Delta

A detailed comparison of these formats is available in Comparison of Data Lake Table Formats (Iceberg, Hudi and Delta Lake).

Typically, Parquet’s binary columnar file format is the prime choice for storing data for analytics. However, there are situations where you may want your table format to use other file formats like AVRO or ORC. Below is a chart that shows which table formats are allowed to make up the data files of a table.



# Format Conversion

Exploring tools like Delta Universal Format (UniForm) and XTable can facilitate format transitions.

# Additional Resources

Further insights are available on the Use transactional processing and my Data Lake/Lakehouse Guide where I wrote in more detail about this.


Origin:
References:
Created 2022-06-10