🧠 Second Brain

Use transactional processing

Last updated Feb 9, 2024

To support various features of a relational database in a data lake, you need transactional processing. Luckily Delta Lake is here to rescue you. Delta has many awesome features like ACID Transactions, Time Travel, keeping a Transaction Log (Delta Lake), SQL API to write native SQL as insert, update, delete, and even merge statements, open-format (Apache Parquet), unified batch and streaming source and sink (no Lambda Architecture needed anymore), Schema Evolution, Optimistic Concurrency. The full list and more information on Delta Lake. Also, check out an excellent customer example with Apple about threat detection at scale.

Key features of delta lake by June 2020 ( delta.io)

As in data lakes, we commonly have distributed files, it’s hard to get them structured and arranged. Especially if you want to insert, update, or delete rows. Delta has different APIs, besides Scala and Python, it also gives you SQL API (from Spark 3.0 on or in Databricks), where you can easily write an update or even merge statements on your distributed files. As it’s powered by Spark, you can do this entirely at scale.

Behind the scenes is raw Apache Parquet, optimized columnar storage, and highly compressed. This allows you to query your data directly out of your data lake efficiently. No initial transformation is needed there.

With **slowly changing dimension (SCD) still being a thing in some cases, Delta has time-travel to solve this. It’s similar to snapshotting what we did back in earlier days. But this time, though, with fairly cheap blob storage compared to SSDs where instead of daily or monthly snapshots, Delta stores every batch of changes as a separate version. This gives you the power to travel back in time to older versions in case you mistakenly deleted something or if you need analytics to be pointed to specific versions. This is for as long as the retention time is set.

These changes and snapshots are stored in a transaction log maintained by Delta. Besides time travel, this is a kind of Change Data Capture (CDC) with all tracked changes per table. You see which files have been affected, what operation, who did it, and more, and that for each transaction. This is stored in a JSON format separately. Transaction data can get big as well, Delta creates a checkpoint file every ten commits and with the entire state of the table at a point in time – in Parquet format which is fast and simple for Spark to read.

References: Apache Parquet Delta Lake Databricks

🧠 Second Brain

Use transactional processing

Interactive Graph

Backlinks