🧠 Second Brain

Search

Search IconIcon to open search

Data Lake Table Formats (Open Table Formats)

Last updated Dec 12, 2024

Prominent table formats include Delta Lake, Apache Iceberg, and Apache Hudi.

Data lake table formats serve as databases-like features on top of distributed File Formats. Similar to a traditional table, these formats consolidate distributed files into a singular table, simplifying management. Consider them an abstraction layer that structures your physical data files into coherent tables.


GitHub Star History

# Tools

Table format tools:

# Why Open Table Formats

Table Formats are a key part of a Data Platform. If you have a Closed-Source Data Platforms but use open table formats, you still own the data, and you are not as locked in. This means you can move the data; maybe you don’t even need to if they are in S3, and you just switch the compute engine to another vendor or an open-source one.

That’s why Microsoft Fabric uses and pushes Delta Lake under the hood, and Snowflake integrated Iceberg Table that are essentially Apache Iceberg tables.

%%{init: {'theme': 'base', 'themeVariables': { 'fontFamily': 'arial', 'fontSize': '16px'}}}%%
flowchart TD
    subgraph Open ["Open Data Architecture"]
        direction TB
        subgraph O [" "]
            direction TB
            O2[Open Compute]
            O4["Open Table Format"]
            O5[File Format]
            O6[Storage]
            
            O2 --> O4
            O4 --> O5
            O5 --> O6
        end
    end

    subgraph Closed ["Typical Closed Data Architecture"]
        direction TB
        subgraph C [" "]
            direction TB
            C2[Vendor Compute]
            C4[Proprietary Table Format]
            C5[Proprietary File Format]
            C6[Storage]
            
            C2 --> C4
            C4 --> C5
            C5 --> C6
        end
    end

    %% SWAP connection
    C4 ---|SWAP| O4

    %% Styling
    classDef subgraphStyle fill:#fff,stroke:#333,stroke-width:2px;
    classDef nodeStyle fill:#f9f9f9,stroke:#666,stroke-width:1px,rx:5px;
    classDef locked fill:#ffecec,stroke:#ff9999;
    classDef unlocked fill:#e6ffe6,stroke:#99cc99;
    classDef neutral fill:#ffffff,stroke:#666;

    class Open,Closed subgraphStyle;
    class O2,C2 nodeStyle;
    class O4 unlocked;
    class C4,C5 locked;
    class O5,O6,C6 neutral;

^6dfb4f

Inspired by Open Table Formats and the Open Data Lakehouse, In Perspective.

# Market Updates

# General Features

# Comparisons: Hudi, Iceberg, Delta

A detailed comparison of these formats is available in Comparison of Data Lake Table Formats (Iceberg, Hudi and Delta Lake).

Typically, Parquet’s binary columnar file format is the prime choice for storing data for analytics. However, there are situations where you may want your table format to use other file formats like AVRO or ORC. Below is a chart that shows which table formats are allowed to make up the data files of a table.



# Format Conversion

Exploring tools like Delta Universal Format (UniForm) and XTable can facilitate format transitions.

# History


The History and Evolution of Open Table Formats - Part I

# Additional Resources

Further insights are available on the Use transactional processing and my Data Lake/Lakehouse Guide where I wrote in more detail about this.

# Composable Data Stacks

Composable Data Stacks relate to table formats, as stacks like Lakehouse are built around open Table Formats.


Origin: Data Lake / Lakehouse Guide: Powered by Data Lake Table Formats (Delta Lake, Iceberg, Hudi)
References:
Created 2022-06-10