Search
Data Lake File Formats
At the core of a Data Lake is the Storage Layer, which predominantly houses files in open-source formats.
These data lake file formats are essentially the cloud’s version of CSVs but with enhanced capabilities. They are more column-oriented, facilitating efficient compression of large files and incorporating additional features. The leading formats in this space are Apache Parquet, Apache Avro, and Apache Arrow. They constitute the physical storage, with files distributed across various buckets within your storage layer.
The utility of data lake file formats extends beyond mere storage. They play a crucial role in data sharing and exchange between different systems and processing frameworks. Key attributes of these formats include their ability to be split and their support for schema evolution, making them versatile across various teams and programming languages.
# File Formats
- Apache Parquet: Columnar format optimized for analytical queries with compression, predicate pushdown, and column pruning (standard for data lakes).
- Apache Avro: Row-based format optimized for fast writes, schema evolution, and streaming pipelines.
- ORC: Columnar format optimized for Hive with built-in indexing, ACID transactions, and excellent compression ratios.
- CSV: Human-readable text format for universal portability and data interchange (convert to Parquet for production use).
- Meta’s Nimble: Experimental columnar format optimized for extremely wide schemas with thousands of columns (Meta’s Parquet successor).
- CWI’s Fastlanes: Ultra-fast compression technique achieving >100 billion integers/second decompression via SIMD-friendly layouts.
- LanceDB’s Lance: Columnar format optimized for ML/AI workflows with 100x faster random access, vector search, and multimodal data support.
- SpiralDB’s Vortex: Extensible columnar format optimized for fast random access (100-200x faster) and scan performance using Fastlanes compression.
- CMU + Tsinghua’s F3 File Format: Academic research format with WebAssembly-embedded decoders for future-proof extensibility and cross-platform interoperability.
- Feather File Format: A portable file format for storing Arrow tables or data frames (from languages like Python or R) that utilizes the Arrow IPC format internally.
- IoTDB TsFile: Optimized for Industrial Time-Series Datasets. It’s a high-performance columnar storage file format designed for industrial time-series data, featuring multi-language interfaces, high compression ratios, high read/write throughput, and fast random access capabilities.
Not a File Format: But still related
Apache Arrow: In-memory columnar format optimized for zero-copy data access and vectorization across languages. The file format for arrow is Feather File Format that is included in Arrow GitHub.
# Git Stars

Link:
GitHub Star History

Check
OSS Insight for more comparisons.
# When to Use Which File Format?
Great overview by Thalia Barrera on File formats - Complete Guide to Data Lakes and Lakehouses Video Tutorial:

# Comparison
Selecting the appropriate data lake file format hinges on specific requirements. Apache Parquet is gaining traction for its efficiency. Avro stands out with its sophisticated schema description language and support for Schema Evolution. However, an interesting development to consider is the relationship between data lake table formats and their underlying file formats at a higher abstraction layer.

Comparison of Data Lake File Formats (Inspired by Nexla: Introduction to Big Data Formats)
📝 Note: The importance of Schema Evolution is somewhat mitigated, as data lake table formats (discussed in the next chapter) also provide support for this feature.
Find more on Data Lake / Lakehouse Guide: Powered by Data Lake Table Formats (Delta Lake, Iceberg, Hudi) | ssp.sh.
# Performance Comparison
By F3 File Format paper:

Source
# Further Reads
Origin: Storage Layer
References: Big Data Pipeline Recipe - Full Article
Created 2022-08-15