Apache Hudi
Hudi is a rich platform to build streaming data lake’s with incremental data pipelines
on a self-managing database layer, while being optimized for lake engines and regular batch processing.
# History
The project was originally developed at Uber in 2016 (code-named and pronounced “Hoodie”), open-sourced in 2017 ( first commit in 2016-12-16), and submitted to the Apache Incubator in January 2019. Source
RW Onehouse Brings a Fully-Managed Lakehouse to Apache Hudi VentureBeat:
which was developed internally at Uber back in 2016 to bring data warehouse-like functionality to Data Lakes. In the intervening years, Hudi has been adopted by major companies such as Amazon, Disney, and Bytedance.
# Hudi Features
- Upserts, Deletes with fast, pluggable indexing.
- Incremental queries, Record level change streams
- Transactions, Rollbacks, Concurrency Control -
- SQL Read/Writes from Apache Spark, Presto, Trino, Apache Hive & more
- Automatic file sizing, data clustering, Compaction, cleaning.
- Streaming ingestion, Built-in CDC sources & tools.
- Built-in metadata tracking for scalable storage access.
- Backwards compatible Schema Evolution and enforcement.
Pushed by Onehouse (Kyle Weller)
# CRDT Future feature
Hudi supports partial row updates, which means 2 different writes/updates could be updating 2 different columns in the same row. Since input data to Hudi are generally coming from a different source such as Kafka which may receive out of order writes, Hudi by itself cannot assign ordered timestamps to these updates. Hence, LWW based on timestamp does not directly work here.
If along with a timestamp based resolution, we also had a deterministic merge function that allows for conflict resolution of these 2 updates, we could possibly apply this function at a later stage and provide correct results. An implementation of such a merge function is a CRDT.
In a CRDT data structure, updates are
- Commutative → Order of updates does not matter
- Idempotent → Applying same updates repeatedly does not matter
- Associative → Order of groups of updates does not matter
This merge function can be implemented by users to ensure a correct ordering of updates based on internal semantics of the data which may be timestamps or other such logic.
To summarize, using a CRDT, we can allow multiple writers to commit concurrently without the need for a lock to perform conflict resolution while still providing ACID semantics. When readers construct the latest state of the table, a CRDT ensures the 2 different snapshots of the data get applied and a deterministic, serial order can be maintained on update conflicts.
Since CRDTs have to be commutative, the order of “merge” does not matter to generate correct result after applying the rules. RFC - 22 : Snapshot Isolation using Optimistic Concurrency Control for multi-writers - HUDI - Apache Software Foundation
Origin: Data Lake Table Format