DuckLake by DuckLabs

DuckLake is a DuckLabs approach to simplifying Lakehouse architectures using a standard SQL database (DuckDB) for all metadata instead of complex file-based systems while still storing data in Open File Formats like Parquet.

The goal: This makes it more reliable, faster, and easier to manage.

# Architecture

It’s not a DuckDB thing per se, but an Open Standards in which they’re trying to simplify the overhead Open Table Formats features such as ACID Transactions, multi-writes, etc., to not write them to files, but to an actual database.

See the image below, which was before separate metadata in files and Open Catalogs; the DuckLake writes all of it to a relational database using a database that provides all the features that open table formats needed to have all database features 😉:

DuckLake’s architecture: Just a database and some Parquet files | Image from The DuckLake Manifesto: SQL as a Lakehouse Format – DuckLake

Or simplified version:

From DuckDB in Science meetup (London) – Opening remarks - YouTube

Important to know

There is the open standard “DuckLake”, that combines the metadata with the catalog, and there is the implementation “DuckLake” (MIT).

# Replacement of OTF?

Does it replace Open Table Formats? No. It is an alternative to Apache Iceberg + Polaris Catalog or Delta Lake + Unity Catalog. This means the catalog and database-like features are in one unit, which is the part that “simplifies”.

# Comparison to Iceberg

Comparison to Apache Iceberg from Hannes Mühleisen on KEYNOTE: Hannes Mühleisen - Data Architecture Turned Upside Down | PyData Amsterdam 2025 - YouTube

More info on the history of architecture History of General Architecture in Data.

# Separation between Compute and Storage

With DuckLake we have the same advantages of Compute and Storage Separation as with OTF:

Hannes notes that Snowflake and BigQuery were “the pioneers” of separating storage and compute
They use the same architecture as DuckLake but with proprietary, closed systems
DuckLake uses “the exact same design that snowflake and bigquery have been using for a long time already successfully with the one difference that on the bottom we have the open format parquet”

The DuckLake’s Innovation is around the three-way separation: Storage, compute, AND metadata can all scale independently.

# How Does the Compute Model Work?

According to Introducing DuckLake - YouTube:

“Everyone runs a DuckDB instance” - Instead of a traditional client-server model, each user/application runs their own compute node
Fully distributed compute: “You can have an arbitrary amount of DuckDB instances” that all connect to the same metadata server and storage
Local compute nodes: The compute runs “on your laptop, your co-workers can have it running on their laptop, you can have services like EC2 instances, lambda functions, wherever you can run DuckDB”
No resource scheduling problems: Eliminates traditional Data Warehouse issues of “fairness and resource scheduling”

The fundamental shift is from “client-server” to “compute-everywhere” - where instead of scaling up a central server and connecting clients to it, you scale out compute nodes that coordinate through lightweight shared metadata.

This eliminates the traditional data warehouse bottlenecks while maintaining full transactional consistency across the distributed system. The clients are essentially the compute nodes.

# Features

Some of the features are here, many more in .

# Zero-Trust Data Lakes (Encryption)

Automatic Encryption with Key Separation, you simply specify encrypted: true - no need to manage encryption keys yourself. DuckLake automatically generates a unique encryption key for every parquet file. These keys are stored in the metadata database, which is in a different trust zone than your file storage.

DuckLake only writes parquet files to blob storage (which have built-in encryption support), and all metadata stays in the separate database, so no unencrypted metadata files on blob storage.

This is a huge innovation, and unblocks many use-cases where you aren’t allowed to store sensitive data on a public data provider.

Real-World Architecture Possibilities

You could have the setup:

Private metadata server: Run a local PostgreSQL server in your organization that you trust

Public blob storage: Store encrypted files on untrusted S3 buckets

CDN distribution: You could even expose encrypted files through Cloudflare caching “because the only thing that you need to keep private is the access to your metadata storage”

Hive or Iceberg catalog couldn’t do that

Hive cannot encrypt because it relies on partition information being visible in directory names. And Iceberg/Delta cannot encrypt because it writes unencrypted Avro and JSON metadata files to the blob store alongside the data.

# Data Inlining

For preventing Small File Problem, writing directly small changes to the metadata using data inlining. Other than Delta or Iceberg that writes Parquet file to the data storage and then writing a reference to it, it writes it to a table withing a catalag (preferbly a database).
https://ducklake.select/docs/stable/duckdb/advanced_features/data_inlining
Data Inlining – DuckLake

# Articles

Here are the two announcement articles by DuckDB on their new site DuckLake.select and Duckdb.org.

# Podcast: Introducing DuckLake

Announcement podcast with Hannes Mühleisen and Mark Raasveldt. Key takeaway:

Introducing DuckLake, a new project from the creators of DuckDB to address challenges in multiplayer database setups and data lake architectures.
DuckDB has gained popularity for single-node analytics, but faces limitations in multi-user environments due to database file locking.
Existing data lake formats like Apache Iceberg and Delta Lake attempt to solve update problems in data lakes, but involve complexity and questionable aesthetics.
DuckLake proposes a new approach by leveraging a database system (e.g., PostgreSQL, DuckDB) to manage metadata for data lakes, enabling transactional consistency and multi-table operations.
Principles of DuckLake: simplicity, scalability, and speed.
Key features include arbitrary query support, multi-schema, multi-table transactions, time travel, schema evolution, views, partitioning, data inlining, and encryption support.
- You can have your encryption keys in the DuckLake in a secure, even locally, place, and leave the data on the data lake encrypted. So even if it’s on public S3, nobody can access, except if it has access to the DuckLake.
- This is very valuable to the full Data Sharing capabilities with Open Table formats
DuckLake offers compatibility with existing Open Table Formats
The DuckLake DuckDB extension implementation supports various metadata storage options (e.g., DuckDB, SQLite, MySQL, PostgreSQL, MotherDuck) and blob storage backends (e.g., S3, Google Cloud Storage, Azure Blob Storage).
DuckLake is positioned as a solution for multiplayer DuckDB, enabling globally synchronized transactions across distributed instances.
Next steps include releasing the extension, adding import/export capabilities for Iceberg, and expanding database support.
Mentioned and shout outs:
- TigerBeetle on Transactional and multi-transactional with Deterministic Simulation Testing
- Arrow Flight SQL where you already can do client-server setting with DuckDB

# The DuckLake Manifesto & Technical Implementation

Sources: The DuckLake Manifesto: SQL as a Lakehouse Format and DuckLake: SQL as a Lakehouse Format.

TL;DR: DuckLake simplifies Lakehouse architectures by leveraging SQL databases for metadata management, enabling efficient data changes and scalability.

DuckLake re-imagines the Lakehouse architecture by storing data in open formats like Parquet while managing metadata within a SQL database, supporting ACID transactions and referential consistency. Unlike Iceberg and Delta Lake, DuckLake avoids complex file-based metadata systems, streamlining processes with standard SQL. The architecture uses an SQL database for catalog and table data, facilitating schema creation, modification, and data operations through SQL transactions.

The DuckLake format, implemented as a DuckDB extension, supports multi-table transactions, complex types, schema evolution, time travel, transactional DDL, and incremental scans. It integrates with various storage systems (S3, Azure Blob Store, GCS) and SQL databases (PostgreSQL, SQLite, MySQL, MotherDuck). DuckLake allows transparent inlining of small changes directly into the metadata store, minimizing small file issues. For example, initializing DuckLake involves ATTACH 'ducklake:metadata.ducklake' AS my_ducklake;, followed by standard SQL operations like CREATE TABLE my_ducklake.demo (i INTEGER); and INSERT INTO my_ducklake.demo VALUES (42), (43);.

As Mark Raasveldt and Hannes Mühleisen said, this design allows for low-latency query planning, reduces small file writes, supports many concurrent changes, and enables sub-millisecond writes with support for millions of snapshots, significantly outperforming traditional lakehouse formats in scenarios with frequent, small data changes.

Practical takeaway: By consolidating metadata management into a SQL database, DuckLake offers a more streamlined and scalable solution for lakehouse architectures, particularly beneficial for applications requiring high concurrency and low-latency data modifications.

Data Model of DuckLake

For how the catalog is structured and set up, check out the physical data model of DuckLake creates internally in .

# Hackernews

Check the comments here.

# MotherDuck’s View

Jordan Tigani writes about the limitations of storing metadata in S3 for data lakehouse formats like Iceberg and Delta Lake, highlighting the performance bottlenecks and complexities in implementing ACID operations. DuckLake addresses these issues by storing table and physical metadata in a database of your choosing (e.g., Postgres), while keeping data in an S3-compatible object store as Parquet files.

This design allows for faster metadata lookups using SQL queries (a couple of milliseconds vs. hundreds of milliseconds for cold S3 scans). DuckLake also reduces metadata file compaction, optimizing snapshot management by pointing multiple snapshots to portions of a single Parquet file. According to Jordan, this approach makes DuckLake more portable and easier to implement than Iceberg. MotherDuck is rolling out hosted support for DuckLake, offering fast, cloud-proximate queries and scalable data transformation.

Jordan argues to consider DuckLake for data lake implementations where metadata performance and transactional capabilities are critical, and leverage MotherDuck’s hosted support for enhanced query speed and data management.

# Podcast Jordan Tigani & Hannes Mühleisen: The Future of Open Table Formats

Join MotherDuck CEO Jordan Tigani and DuckDB’s Hannes Mühleisen for an in-depth discussion about DuckLake.

My takeaways:

Avro is bad as it can only be read row based. So if you have large metadata, it’s not optimized. Also, in terms of Iceberg, if you already have Parquet, adding another format just to do metadata complicates everything
Access and permission pattern can be done through encryption key, potentially new features is encryption key based on column, that Parquet supports. Meaning you could have different keys per column and give access like this. This is not very advanced compared to a database where you can give row-and-column level access, but as its based on distributed files, that would already be great. MotherDuck is probably better positioned to integrate something easier here.
Open Table Formats makes sense cost wise for data you do not need every day. Instead of storing them in costly datawarehouses with dedicated hardware, you can store them on “cheap” S3 distributed storage.
The iceberg writers are sparse. Not because nobody wants to implement it, but because you basically have to build and integrate database capabilities into it, as the file format is just a “format” and not the actual engine that runs it. It’s also very hard with Iceberg, as you have now Avro (metadata), Parquet (data, snapshots) and new Open Catalogs for access and listing capabilites. All of it has to be coordinated and done concurrently in best cases when more people want to write at the same time.
- this is why ducklake was created, as databases are designed for this atomic, integrity style data. And that’s why ducklake can support 1000 of concurrent updates (e.g. through Postgres) where Iceberg is around 1 per second according to some.
- Updates can be performed transactionally, and changes can be inlined into database tables for high-frequency updates.
Critics beside the huge amount of positive: to create a new standard when people started to align on Iceberg/Delta/Hudi. But in fact, it’s not a new one, it’s adding and simplifies to not lock in users to a vendor through their closed-source implementation of catalogs.

AI Highlights from YouTube:

Open data formats like Iceberg and Delta Lake aim to break data silos and facilitate data exchange between systems.
Hadoop era pain points included unruly data lakes and challenges with updating data.
Concerns that open table formats are conceptually flawed and complex to implement, particularly writers.
Duck Lake: Acknowledges the need for a catalog server (database) for metadata management, unlike Iceberg’s initial file-based structure.
Duck Lake uses a database schema for metadata, including tables for tables, columns, and files, along with a specification and DuckDB extension.
Duck Lake offers a simpler, more elegant solution with fewer technologies involved, making implementation easier.
Duck Lake can handle a large number of snapshots without performance issues.
Duck Lake is agnostic to the metadata database used, allowing flexibility in choosing the storage solution.
Initial feedback on Duck Lake has been positive, with vendors expressing interest.
Discussion of access management and encryption strategies for Duck Lake.
Thoughts on the Iceberg REST API and the tension between database people and developers.

# Exmples

# DuckLake Example (Code) including Key Feature

This example demonstrates DuckLake’s key features:

Catalog Database: DuckDB (for metadata storage)
Storage Backend: AWS S3 (for Parquet files)
Schema Evolution: Adding/dropping columns, type promotion
Time Travel: Querying historical snapshots

# Prerequisites

1
2


INSTALL ducklake;
INSTALL aws;  -- for S3 support

# Setup DuckLake with S3 Storage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


-- Configure AWS credentials (if needed)
CREATE SECRET ducklake (
    TYPE s3,
    KEY_ID 'my_secret_key1', --replace with yours
    SECRET 'my_secret_value1', --replace with yours
    REGION 'eu-central-1', --replace with yours
    SCOPE 's3://sspaeti/warehouse-data' --replace with yours
);

--list
FROM duckdb_secrets();

-- Create DuckLake with DuckDB catalog and S3 storage
ATTACH 'ducklake:my_warehouse.ducklake' AS warehouse 
    (DATA_PATH 's3://sspaeti/warehouse-data/');
USE warehouse;

# Create Initial Table and Insert Data

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


-- Create a customer table
CREATE TABLE customers (
    id INTEGER,
    name VARCHAR,
    email VARCHAR,
    created_date DATE
);

-- Insert initial data
INSERT INTO customers VALUES 
    (1, 'Alice Johnson', 'alice@email.com', '2024-01-15'),
    (2, 'Bob Smith', 'bob@email.com', '2024-02-20'),
    (3, 'Carol Davis', 'carol@email.com', '2024-03-10');

-- Check current data
FROM customers;

# Schema Evolution - Adding Columns

Schema Evolution is a great function:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24


-- Add a new column with default value
ALTER TABLE customers ADD COLUMN phone VARCHAR DEFAULT 'Not provided';

-- Add another column for customer status
ALTER TABLE customers ADD COLUMN status VARCHAR DEFAULT 'Active';

-- Verify schema changes
FROM customers;
-- Output shows new columns with default values for existing rows

-- Insert new data with all columns
INSERT INTO customers VALUES 
    (4, 'David Wilson', 'david@email.com', '2024-04-05', '555-0123', 'Premium');
    
FROM customers;
┌───────┬───────────────┬─────────────────┬──────────────┬──────────────┬─────────┐
│  id   │     name      │      email      │ created_date │    phone     │ status  │
│ int32 │    varchar    │     varchar     │     date     │   varchar    │ varchar │
├───────┼───────────────┼─────────────────┼──────────────┼──────────────┼─────────┤
│     1 │ Alice Johnson │ alice@email.com │ 2024-01-15   │ Not provided │ Active  │
│     2 │ Bob Smith     │ bob@email.com   │ 2024-02-20   │ Not provided │ Active  │
│     3 │ Carol Davis   │ carol@email.com │ 2024-03-10   │ Not provided │ Active  │
│     4 │ David Wilson  │ david@email.com │ 2024-04-05   │ 555-0123     │ Premium │
└───────┴───────────────┴─────────────────┴──────────────┴──────────────┴─────────┘

# More Data Changes for Time Travel Demo

1
2
3
4
5
6
7
8
9


-- Update customer status 
UPDATE customers SET status = 'VIP' WHERE email = 'alice@email.com';

-- Add more customers
INSERT INTO customers VALUES 
    (5, 'Eva Martinez', 'eva@email.com', '2024-05-01', '555-0456', 'Active');

-- Delete a customer (creates another snapshot)
DELETE FROM customers WHERE id = 2;

# How Deletes Are Handled Internally (tombstone pattern)

Apparently deletes are stored in the data catalog:

When you DELETE data already written to Parquet it doesn’t create a new data file or a delete file.

It creates a record of what you deleted as inlined rows in the metadata catalog and filters it out when you query the data. Tweet by Hoyt Emerson

This is called the Tombstone Pattern

Keeping deleted record in a replica for usually distributed data store.

# Schema Evolution - Dropping Columns

1
2
3
4
5


-- Drop the phone column (maybe privacy concerns)
ALTER TABLE customers DROP COLUMN phone;

-- Verify the column is gone
FROM customers;

# Time Travel Queries

Time Travel features:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25


-- List all snapshots to see our history
FROM warehouse.snapshots();
D FROM warehouse.snapshots();
┌─────────────┬────────────────────────────┬────────────────┬─────────────────────────────────────────────────────┐
│ snapshot_id │       snapshot_time        │ schema_version │                       changes                       │
│    int64    │  timestamp with time zone  │     int64      │               map(varchar, varchar[])               │
├─────────────┼────────────────────────────┼────────────────┼─────────────────────────────────────────────────────┤
│           0 │ 2025-05-28 15:19:40.916+02 │              0 │ {schemas_created=[main]}                            │
│           1 │ 2025-05-28 15:20:15.92+02  │              1 │ {tables_created=[main.customers]}                   │
│           2 │ 2025-05-28 15:26:00.582+02 │              2 │ {tables_dropped=[1]}                                │
│           3 │ 2025-05-28 15:26:02.818+02 │              3 │ {tables_created=[main.customers]}                   │
│           4 │ 2025-05-28 15:31:15.704+02 │              3 │ {tables_inserted_into=[2]}                          │
│           5 │ 2025-05-28 15:31:35.534+02 │              4 │ {tables_dropped=[2]}                                │
│           6 │ 2025-05-28 15:31:41.484+02 │              5 │ {tables_created=[main.customers]}                   │
│           7 │ 2025-05-28 15:31:43.698+02 │              5 │ {tables_inserted_into=[3]}                          │
│           8 │ 2025-05-28 15:32:04.017+02 │              6 │ {tables_altered=[3]}                                │
│           9 │ 2025-05-28 15:32:08.438+02 │              7 │ {tables_altered=[3]}                                │
│          10 │ 2025-05-28 15:32:23.051+02 │              7 │ {tables_inserted_into=[3]}                          │
│          11 │ 2025-05-28 15:33:49.549+02 │              7 │ {tables_deleted_from=[3]}                           │
│          12 │ 2025-05-28 15:36:59.068+02 │              7 │ {tables_inserted_into=[3], tables_deleted_from=[3]} │
│          13 │ 2025-05-28 15:37:16.985+02 │              7 │ {tables_inserted_into=[3]}                          │
│          14 │ 2025-05-28 15:37:42.358+02 │              8 │ {tables_altered=[3]}                                │
├─────────────┴────────────────────────────┴────────────────┴─────────────────────────────────────────────────────┤
│ 15 rows                                                                                               4 columns │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Query version 10 including the phone number column still:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


D SELECT * FROM customers AT (VERSION => 10);
┌───────┬───────────────┬─────────────────┬──────────────┬──────────────┬─────────┐
│  id   │     name      │      email      │ created_date │    phone     │ status  │
│ int32 │    varchar    │     varchar     │     date     │   varchar    │ varchar │
├───────┼───────────────┼─────────────────┼──────────────┼──────────────┼─────────┤
│     1 │ Alice Johnson │ alice@email.com │ 2024-01-15   │ Not provided │ Active  │
│     2 │ Bob Smith     │ bob@email.com   │ 2024-02-20   │ Not provided │ Active  │
│     3 │ Carol Davis   │ carol@email.com │ 2024-03-10   │ Not provided │ Active  │
│     4 │ David Wilson  │ david@email.com │ 2024-04-05   │ 555-0123     │ Premium │
└───────┴───────────────┴─────────────────┴──────────────┴──────────────┴─────────┘

eventough current version does not hold it anymore:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


D from customers;
┌───────┬───────────────┬─────────────────┬──────────────┬─────────┐
│  id   │     name      │      email      │ created_date │ status  │
│ int32 │    varchar    │     varchar     │     date     │ varchar │
├───────┼───────────────┼─────────────────┼──────────────┼─────────┤
│     3 │ Carol Davis   │ carol@email.com │ 2024-03-10   │ Active  │
│     4 │ David Wilson  │ david@email.com │ 2024-04-05   │ Premium │
│     1 │ Alice Johnson │ alice@email.com │ 2024-01-15   │ VIP     │
│     5 │ Eva Martinez  │ eva@email.com   │ 2024-05-01   │ Active  │
└───────┴───────────────┴─────────────────┴──────────────┴─────────┘

This is how the S3 Object store looks like:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


-- Query by timestamp (if you know approximate times)
D SELECT * FROM customers AT (TIMESTAMP => '2025-05-28 15:31:43.698+02');
┌───────┬───────────────┬─────────────────┬──────────────┐
│  id   │     name      │      email      │ created_date │
│ int32 │    varchar    │     varchar     │     date     │
├───────┼───────────────┼─────────────────┼──────────────┤
│     1 │ Alice Johnson │ alice@email.com │ 2024-01-15   │
│     2 │ Bob Smith     │ bob@email.com   │ 2024-02-20   │
│     3 │ Carol Davis   │ carol@email.com │ 2024-03-10   │
└───────┴───────────────┴─────────────────┴──────────────┘

# Advanced Features

# Change Data Feed

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


-- See what changed between snapshots 1 and 3
D SELECT * FROM warehouse.table_changes('customers', 0, 14);
┌─────────────┬───────┬──────────────────┬───────┬───────────────┬─────────────────┬──────────────┬─────────┐
│ snapshot_id │ rowid │   change_type    │  id   │     name      │      email      │ created_date │ status  │
│    int64    │ int64 │     varchar      │ int32 │    varchar    │     varchar     │     date     │ varchar │
├─────────────┼───────┼──────────────────┼───────┼───────────────┼─────────────────┼──────────────┼─────────┤
│           7 │     0 │ insert           │     1 │ Alice Johnson │ alice@email.com │ 2024-01-15   │ Active  │
│           7 │     1 │ insert           │     2 │ Bob Smith     │ bob@email.com   │ 2024-02-20   │ Active  │
│           7 │     2 │ insert           │     3 │ Carol Davis   │ carol@email.com │ 2024-03-10   │ Active  │
│          10 │     3 │ insert           │     4 │ David Wilson  │ david@email.com │ 2024-04-05   │ Premium │
│          12 │     0 │ update_postimage │     1 │ Alice Johnson │ alice@email.com │ 2024-01-15   │ VIP     │
│          13 │     5 │ insert           │     5 │ Eva Martinez  │ eva@email.com   │ 2024-05-01   │ Active  │
│          12 │     0 │ update_preimage  │     1 │ Alice Johnson │ alice@email.com │ 2024-01-15   │ Active  │
│          11 │     1 │ delete           │     2 │ Bob Smith     │ bob@email.com   │ 2024-02-20   │ Active  │
└─────────────┴───────┴──────────────────┴───────┴───────────────┴─────────────────┴──────────────┴─────────┘

# Exploring Storage Files

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


-- See the Parquet files in S3 (if using local storage, adjust path)
D FROM glob('s3://sspaeti/warehouse-data/*');
┌──────────────────────────────────────────────────────────────────────────────────────────┐
│                                           file                                           │
│                                         varchar                                          │
├──────────────────────────────────────────────────────────────────────────────────────────┤
│ s3://sspaeti/warehouse-data/ducklake-01971717-0b7b-7d25-a85e-1254fe2c1014.parquet        │
│ s3://sspaeti/warehouse-data/ducklake-01971717-78e1-7c42-9d20-1b6387d0476a.parquet        │
│ s3://sspaeti/warehouse-data/ducklake-01971718-128d-7faf-9415-c7e452f29887.parquet        │
│ s3://sspaeti/warehouse-data/ducklake-01971719-653a-7990-8375-f5b5b5732f69-delete.parquet │
│ s3://sspaeti/warehouse-data/ducklake-0197171c-48be-7bfa-b39d-dc6013a6d3bc.parquet        │
│ s3://sspaeti/warehouse-data/ducklake-0197171c-4c6f-7c74-9970-4583571c429a-delete.parquet │
│ s3://sspaeti/warehouse-data/ducklake-0197171c-8eba-7a40-8cf7-67fc115ace70.parquet        │
└──────────────────────────────────────────────────────────────────────────────────────────┘

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


-- Query the actual Parquet files directly
D FROM 's3://sspaeti/warehouse-data/*.parquet' LIMIT 5;
┌───────┬───────────────┬─────────────────┬──────────────┐
│  id   │     name      │      email      │ created_date │
│ int32 │    varchar    │     varchar     │     date     │
├───────┼───────────────┼─────────────────┼──────────────┤
│     1 │ Alice Johnson │ alice@email.com │ 2024-01-15   │
│     2 │ Bob Smith     │ bob@email.com   │ 2024-02-20   │
│     3 │ Carol Davis   │ carol@email.com │ 2024-03-10   │
│     1 │ Alice Johnson │ alice@email.com │ 2024-01-15   │
│     2 │ Bob Smith     │ bob@email.com   │ 2024-02-20   │
└───────┴───────────────┴─────────────────┴──────────────┘

# Multiple Table Operations

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


-- Create related table for ACID transaction demo
CREATE TABLE orders (
    order_id INTEGER,
    customer_id BIGINT,
    amount DECIMAL(10,2),
    order_date DATE
);

-- Atomic multi-table transaction
BEGIN TRANSACTION;
INSERT INTO orders VALUES (1001, 1, 99.99, '2024-05-28');
UPDATE customers SET status = 'Premium' where status = 'VIP';
COMMIT;

Result:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17


D from orders;
┌──────────┬─────────────┬───────────────┬────────────┐
│ order_id │ customer_id │    amount     │ order_date │
│  int32   │    int64    │ decimal(10,2) │    date    │
├──────────┼─────────────┼───────────────┼────────────┤
│   1001   │      1      │     99.99     │ 2024-05-28 │
└──────────┴─────────────┴───────────────┴────────────┘
D from customers;
┌───────┬───────────────┬─────────────────┬──────────────┬─────────┐
│  id   │     name      │      email      │ created_date │ status  │
│ int32 │    varchar    │     varchar     │     date     │ varchar │
├───────┼───────────────┼─────────────────┼──────────────┼─────────┤
│     3 │ Carol Davis   │ carol@email.com │ 2024-03-10   │ Active  │
│     4 │ David Wilson  │ david@email.com │ 2024-04-05   │ Premium │
│     5 │ Eva Martinez  │ eva@email.com   │ 2024-05-01   │ Active  │
│     1 │ Alice Johnson │ alice@email.com │ 2024-01-15   │ Premium │
└───────┴───────────────┴─────────────────┴──────────────┴─────────┘

# How the DuckLake Catalog Looks Like?

It created this tables:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27


D select table_name, table_type from information_schema.tables where table_catalog = '__ducklake_metadata_warehouse';
┌───────────────────────────────────────┬────────────┐
│              table_name               │ table_type │
│                varchar                │  varchar   │
├───────────────────────────────────────┼────────────┤
│ ducklake_column                       │ BASE TABLE │
│ ducklake_column_tag                   │ BASE TABLE │
│ ducklake_data_file                    │ BASE TABLE │
│ ducklake_delete_file                  │ BASE TABLE │
│ ducklake_files_scheduled_for_deletion │ BASE TABLE │
│ ducklake_file_column_statistics       │ BASE TABLE │
│ ducklake_file_partition_value         │ BASE TABLE │
│ ducklake_inlined_data_tables          │ BASE TABLE │
│ ducklake_metadata                     │ BASE TABLE │
│ ducklake_partition_column             │ BASE TABLE │
│ ducklake_partition_info               │ BASE TABLE │
│ ducklake_schema                       │ BASE TABLE │
│ ducklake_snapshot                     │ BASE TABLE │
│ ducklake_snapshot_changes             │ BASE TABLE │
│ ducklake_table                        │ BASE TABLE │
│ ducklake_table_column_stats           │ BASE TABLE │
│ ducklake_table_stats                  │ BASE TABLE │
│ ducklake_tag                          │ BASE TABLE │
│ ducklake_view                         │ BASE TABLE │
├───────────────────────────────────────┴────────────┤
│ 19 rows                                  2 columns │
└────────────────────────────────────────────────────┘

# Comparison with Other Open Table Formats

Downloads extensions of DuckDB extensions as of 2026-05-12 at AI Council 2026 by Hannes Mühleisen:

# History and Adoption

2026-04-13 v1.0 is out: DuckLake v1.0 – DuckLake - and it’s adoption has bean very steadily increasing:

We are delighted to see how quickly the community adopted DuckLake. The ducklake DuckDB extension is now ranked among DuckDB’s top-10 core extensions based on the download statistics.

There are now DuckLake clients for Apache DataFusion (by Hotdata), Apache Spark (by MotherDuck) (by Andrew Witten), Trino and Pandas dataframe (mostly produced by agentic coding). MotherDuck also offers a hosted DuckLake service, where they manage both the catalog database and the data storage for you.

We are very proud that DuckLake is already used in production at dozens of companies – see the collection of logos on our new landing page. And there is even an O’Reilly DuckLake book in the making!

Unity Catalog Authentication: At the time of initial development we used Unity Catalog 0.1.0. This version did not include Role-Based Access Control (RBAC) yet. Since then, support for this was added in Unity Catalog 0.2.0, making our setup even more powerful. In the upcoming series, we’ll be able to demonstrate how to set it up—so stay tuned for that future post.
Read-Only Delta Support: Currently, DuckDB can only read from Unity Catalog, because of its dependency on delta-kernel-rs, which limits operations to read-only. The DuckDB team is actively working on write support, and it’s definitely on their roadmap. In the meantime, we’ve developed a workaround to help you write to Unity Catalog, so you can keep moving forward even before native support arrives.
Motherduck Integration: Motherduck offers a managed DuckDB experience, but unfortunately does not yet support uc_catalog extension, that doesn’t stop us! To get similar notebook integration, we have built a solution using Jupyter notebooks, a web-based tool for interactive computing.

# SwanLake: DuckLake Took Flight. Welcome to SwanLake.

A DuckLake implementations with Apache Arrow. Blog post and source code at GitHub.

# Utils

Parkbench: A cli tool to benchmark and stream events to a DuckLake or DuckDB destination by Hoyt Emerson

Origin: x.com
References: DuckHouse