Search
Nimble
New file format for storage of large columnar datasets.
Nimble (formerly known as “Alpha”) is a new columnar file format for large datasets created by Meta. Nimble is meant to be a replacement for file formats such as Apache Parquet and ORC.
Watch this talk to learn more about Nimble’s internals.
# Design Principles
Nimble has the following design principles:
- Wide: Nimble is better suited for workloads that are wide in nature, such as tables with thousands of columns (or streams) which are commonly found in feature engineering workloads and training tables for machine learning.
- Extensible: Since the state-of-the-art in data encoding evolves faster than the file layout itself, Nimble decouples stream encoding from the underlying physical layout. Nimble allows encodings to be extended by library users and recursively applied (cascading).
- Parallel: Nimble is meant to fully leverage highly parallel hardware by providing encodings which are SIMD and GPU friendly. Although this is not implemented yet, we intend to expose metadata to allow developers to better plan decoding trees and schedule kernels without requiring the data streams themselves.
- Unified: More than a specification, Nimble is a product. We strongly discourage developers to (re-)implement Nimble’s spec to prevent environmental fragmentation issues observed with similar projects in the past. We encourage developers to leverage the single unified Nimble library, and create high-quality bindings to other languages as needed.
# Features
Nimble has the following features:
- Lighter metadata organization to efficiently support thousands to tens of thousands of columns and streams.
- Use Flatbuffers instead of thrift/protobuf to more efficiently access large metadata sections.
- Use block encoding instead of stream encoding to provide predictable memory usage while decoding/reading.
- Supports many encodings out-of-the-box, and additional encodings can be added as needed.
- Supports cascading (recursive/composite) encoding of streams.
- Supports pluggable encoding selection policies.
- Provide extensibility APIs where encodings and other aspects of the file can be extended.
- Clear separation between logical and physical encoded types.
- And more.
More on GitHub
Origin:
Nimble and Lance: The Parquet Killers - by Chris Riccomini
References:
Created 2024-05-21