Search

Search IconIcon to open search

DuckDB

Last updated by Simon Späti

DuckDB is an in-process SQL OLAP database management system. It has strong support for SQL. DuckDB is borrowing the SQLite shell implementation. Each database is a single file on disk. It’s SQLite DB for analytical (OLAP) workloads, whereas SQLite is for a relational database. But it can handle vast amounts of data locally. It’s the smaller, lighter version of Apache Druid and other OLAP technologies.

It’s designed to work as an embedded library, eliminating the network latency you usually get when talking to a database.

WIP: This is an unfinished Note

Below are my uncleaned DuckDB notes, I shared them as these are hopefully helpful to others. Once I have more time, I will convert it into a nice blog or article. For now, it’s a bit messy.

# Why DuckDB

Abstract: Why DuckDB

DuckDB is an innovative in-process analytical database management system that combines simplicity, portability, and high performance. Designed as an embedded SQL OLAP database, DuckDB addresses the need for efficient data analysis on local machines without the complexity of traditional database setups.

Key features of DuckDB include:

  1. Simplicity: With no external dependencies and an embedded architecture, DuckDB eliminates the need for separate server processes or complex configurations.
  2. Portability: DuckDB runs on various operating systems and CPU architectures, from edge devices to powerful servers, and even in web browsers via WebAssembly.
  3. Performance: Utilizing a columnar-vectorized query execution engine, DuckDB achieves superior speed for analytical workloads compared to row-based systems.
  4. Feature-rich: Despite its lightweight nature, DuckDB offers advanced SQL support, ACID compliance, and integration with popular data science tools.
  5. Extensibility: A flexible extension mechanism allows for customization and additional functionality.
  6. Open-source: Released under the MIT License, DuckDB is freely available and open for community contributions.

DuckDB fills a unique niche in the database ecosystem, offering SQLite-like simplicity for analytical tasks. It’s particularly well-suited for data scientists and analysts working with local datasets, providing a fast, free, and user-friendly alternative to traditional data warehousing solutions for many use cases.

^f86df0

More on Why DuckDB.

# History & Organization

DuckDB is primarily the work of Mark Raasveldt and Hannes Mühleisen. It’s made up of a million lines of C++ and runs as a stand-alone binary. Initially, they tried to improve MonetDB, but it was too old, so they started a new one out of CWI.

# First Released: 2019

Development is very active, with the commit count on its GitHub repo doubling nearly every year since it began in 2018. It was first released in 2019. DuckDB uses PostgreSQL’s SQL parser, Google’s RE2 regular expression engine and SQLite’s shell.

# DuckDB Labs Vs MotherDuck

DuckDB Labs was started by Hannes Mühleisen and Mark Raasveldt at the Database Architectures Group at the Centrum Wiskunde & Informatica (CWI) in Amsterdam, The Netherlands. They also founded DuckDB Labs, focusing on commercial services around DuckDB. MotherDuck is the US-based company that is VC-funded, which ships distributed DuckDB services. More details in Announcing the MotherDuck Partnership.

DuckDB Labs are the original founders of DuckDB and they announced 2022-11-15 the partnership with MotherDuck see RW Announcing the MotherDuck Partnership.

MotherDuck is the managed service around DuckDB that lets you scale from a local DB to a cloud DB and hybrid. Done by one of Google BigQuery creators or developers such as Jordan Tigani. Also check his discussion on the Analytics Engineering Podcast about The Personal Data Warehouse. Interesting discussion around WebAssembly (WASM), e.g., an application compiled to C code, which is super fast. Figma is using that, which would otherwise never work in a browser.

# Release Calendar

# Getting Started

# Installation

On MacOS brew install duckdb, otherwise check DuckDB Installation – DuckDB.

  • Manage versions of DuckDB with duckdb-version-manager called Duckman: Manage your DuckDB CLI with ease
    • E.g. duckman install 0.10.2, duckman default 0.10.2, duckman run nightly or duckman list remote

# Demo App

Start with DuckLake and immediately have it up and running?

1
duckdb ducklake:https://github.com/marhar/frozen/raw/refs/heads/main/space.ducklake

and then SHOW TABLES;.

Source: Frozen DuckLakes for Multi-User, Serverless Data Access – DuckLake.

# Run in the Browser

Run DuckDB in your Website

# Features

# Use Cases

See DuckDB Use Cases

# Categories by DuckDB

The questions are usually:

  • Is DuckDB like Snowflake? Not really.
  • Is DuckDB like PostgreSQL? No, cousins maybe?
  • Is DuckDB like Pandas? It’s complicated.
  • Is DuckDB like SQLite? Yes and no!
  • Is DuckDB like Apache Spark? Interesting.

To better understand DuckDB and how it fits into use cases within large enterprises, let’s define the categories in which it can be used. The table below highlights DuckDB’s versatility.

graph TD
    DuckDB[DuckDB]
    DuckDB --> Interactive[Interactive Data Apps]
    DuckDB --> Extensible[Pipeline Compute Engine]
    DuckDB --> CostEffective[Lightweight SQL Analytics Solution]
    DuckDB --> Secure[Secure Enterprise Data Handler]
    DuckDB --> ZeroCopy[Zero-Copy SQL Connector]

    Interactive --> |Embeddable| I[Browser/App Integration]
    Extensible --> |On-Demand & High-Performance| E[SQL Workflows]
    CostEffective --> |Single-node| C[Compute Engine]
    Secure --> |Enhanced Security| S[Compliant Processing]
    ZeroCopy --> |Virtualized| Z[SQL Query Engine]

    classDef useCase fill:#FFEF03,stroke:#333,stroke-width:2px,color:#000
    class Interactive,Extensible,CostEffective,Secure,ZeroCopy useCase;

See more on The Enterprise Case for DuckDB: 5 Key Categories and Why Use It | ssp.sh. ^4d2559

Simple one as of 2024-09-16, categories by Hannes Mühleisen:

Image from DuckDB – Overview and latest developments (DuckCon #5, Seattle, 2024) - YouTube

# Multi-categories by Alireza Sadeghi

As Alireza Sadeghi would explain it:

# Curated Use-Cases

See extensive use cases in DuckDB Use Cases.

# Geospatial

DuckDB Geospatial

# Technical Deep Dive

DuckDB is a columnar-vectorized query execution engine.

# Why Is It So Fast?

From Why DuckDB

DuckDB contains a columnar-vectorized query execution engine, where queries are still interpreted, but a large batch of values (a “vector”) are processed in one operation. This greatly reduces overhead present in traditional systems such as PostgreSQL, MySQL or SQLite which process each row sequentially. Vectorized query execution leads to far better performance in OLAP queries. See more on Columnar vs. Row-based vs Vector-based

It ships as an amalgamation build - a single giant C++ file (SQLite is a single giant C file). And it’s also backed up by some strong computer science. It’s by the academic researchers behind MonetDB and includes implementations of a bunch of interesting papers:

Culmination of Factors

It’s the culmination of lots of research in columnar systems. Including vectorized compute, decoupled storage, file formats, query plans, join optimization, volcano vs push, etc. Tweet / Twitter. See also Differential Storage

Also check out Vectorized Query Engine.

# Comments from Developer Voices with Hannes Mühleisen

# Postgres Has Not Been Implemented but Helped Greatly with the Speed of DuckDB Too

  1. Smart Use of Modern Hardware: Recognizing that hardware tends to improve faster than data size grows, DuckDB is designed to make the most of the processing power available on typical modern computers. This means that for many practical use cases, DuckDB can handle the data processing needs without requiring distributed systems that add complexity and overhead.

# Philosophies and Guiding Principles of DuckDB

Personal philosophies and first principles they tried to solve from the very beginning, which were the guiding principles for Hannes Mühleisen to follow when building DuckDB:

  1. Simplicity and Usability: One of the core ideas behind DuckDB is to challenge the complexity traditionally associated with setting up and using data management systems. The team aimed to create a database that is straightforward to install and use, removing the barriers often encountered with tasks like configuration, user creation, and database initialization.
  2. Efficient Data Ingestion and Export: Recognizing that getting data into and out of a database is a fundamental operation that should be optimized, DuckDB focuses on making data import and export operations fast and easy. This focus on data movement efficiency directly addresses one of the initial interactions users have with the system, aiming to leave a positive impression and facilitate smooth data workflows.
  3. In-Process Model: DuckDB adopts an in-process model similar to SQLite, where the database runs within the same process as the application. This design choice significantly influences the system’s architecture, as it necessitates careful resource management and crash safety measures since the database shares its runtime environment with the host application.

All with Academic Rigor with Practical Application. The development of DuckDB is deeply rooted in academic research, leveraging advanced concepts in database design and query optimization. However, unlike some academic projects that remain theoretical, DuckDB has been built with a focus on practical application and ease of use. This ensures that it not only incorporates cutting-edge techniques for data processing but is also accessible and beneficial to a wide range of users.

# Comparing to Pandas

  1. Row-by-Row Processing: Pandas typically operates on a row-by-row basis for many operations, which can be less efficient than columnar processing for certain types of data manipulations and analyses. This approach can lead to increased memory usage and slower performance when working with large datasets.
  2. Materialization: Hannes discussed the concept of materialization, where intermediate results are stored in memory during the execution of a sequence of operations. In Pandas, operations might result in the creation of new data structures (like DataFrames) at each step, which can be resource-intensive and slow down computations.

DuckDB’s approach, by contrast, is to use vectorized query execution, where operations are performed on batches of data (chunks of rows) at a time, and it employs a columnar storage format. This method is more efficient for CPUs to process because it can take better advantage of modern hardware capabilities, such as cache utilization and vectorized instructions. The columnar approach also reduces the overhead associated with row-by-row processing and intermediate data materialization, leading to faster execution times, especially for analytical queries that operate on large volumes of data.

# DuckDB Storage Format Compared to Parquet

DuckDB has its own storage format. It looks a bit like Parquet, but there are important differences. A DuckDB table is split into “row groups”. A row group contains “column segments”. This is common for analytics storage formats. Parquet also uses it. Terminology is not the same: Parquet says “column chunks” instead of “column segments”. But there are bigger differences: (1) DuckDB does not use general-purpose compression (e.g. Snappy) (2) DuckDB uses a fixed block size (1) leads to a bigger data size, but requires less computation for (de)compression. (2) enables in-place modifications, including updates and deletes. Mutability! DuckDB does use lightweight data-specific compression algorithms. For example run-length encoding and dictionary encoding, which I’ve written about before. Different column segments can use different algorithms. Each segment is analyzed to determine which one is best. Data type and observed patterns (if any) are relevant factors. Mark Raasveldt (DuckDB creator) wrote a great blog about compression in DuckDB.
Jorrit Sandbrink on LinkedIn: #dataengineering #softwareengineering

And the referenced blog Lightweight Compression in DuckDB – DuckDB.

# DuckDB on top of Parquet vs. Native DuckDB

If you load into DuckDB native, e.g., the NYC Taxi file, you can drastically reduce time as well. But it’s already fast with the zero-copy on top of Parquet.

On top of Parquet:

With native DuckDB:

# Adoption & Market

Isn’t the beauty and the hype coming (mostly) from DuckDB itself, which is free, extremely fast, and developer-friendly? I haven’t used MotherDuck service yet, but DuckDB things like

1
2
CALL load_aws_credentials();
SELECT * FROM "s3://some-bucket/that/requires/authentication.parquet";

or

1
2
-- read all files that match the glob pattern
SELECT * FROM 'test/*.parquet';

I used DuckDB here and there, and the speed and efficiency, e.g., as a zero-store layer on top of my data lake (e.g., see Build a poor man’s data lake from scratch https://dagster.io/blog/duckdb-data-lake) is pretty cool to me. Also, that you come pretty far with a single file database is somewhat surprising to me.

Google Trend:

duckdb - Explore - Google Trends

GitHub Stars - Hockey-stick

GitHub Star History

Niche, in comparison to SQLite:

duckdb, /m/01kgq9 - Explore - Google Trends

# Numbers from DuckCon

From DuckDB – Overview and latest developments (DuckCon #5, Seattle, 2024) - YouTube:

# Benchmarks

# Working with DuckDB

# Secret Management

# Listing Secrets

1
FROM duckdb_secrets();

# Creating Secrets

Create one based on ENV vars:

1
2
3
4
5
CREATE SECRET secret3 (
    TYPE S3,
    PROVIDER CREDENTIAL_CHAIN,
    CHAIN 'env;config'
);

# Persistent Secret

1
2
3
4
5
CREATE PERSISTENT SECRET my_persistent_secret (
    TYPE S3,
    KEY_ID 'my_secret_key',
    SECRET 'my_secret_value'
);

# Temporary Secret with Keys

1
2
3
4
5
6
CREATE SECRET my_secret (
    TYPE S3,
    KEY_ID 'my_secret_key',
    SECRET 'my_secret_value',
    REGION 'my_region'
);

See more on Secrets Manager – DuckDB

# AWS S3 SDK Credential Auto-detection

AWS S3 SDK Credential Auto-Detection Issue with Public Buckets

# Extensions

# DuckDB Delta Lake Native Integration

Also integration with Unity Catalog announced at Databricks Data & AI Summit 2024

# QuackStore

A QuackStore- OLAP Cache Layer for DuckDB

# DuckDB Community Extensions

See DuckDB Community Extensions.

Check the report of maintainability and latest activities on DuckDB Extensions Analysis.

# MSSQL

Microsoft SQL Server just got an extension to bulk-load with DuckDB.

# Encryption

Available since 2025-11-20, see Data-at-Rest Encryption in DuckDB.

# Code Examples & Integrations

# Python API

DuckDB - Python API

# Fetching Data with SQL

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
# fetch as pandas data frame
df = con.execute("SELECT * FROM items").fetchdf()
print(df)
#        item   value  count
# 0     jeans    20.0      1
# 1    hammer    42.2      2
# 2    laptop  2000.0      1
# 3  chainsaw   500.0     10
# 4    iphone   300.0      2

# fetch as dictionary of numpy arrays
arr = con.execute("SELECT * FROM items").fetchnumpy()
print(arr)
# {'item': masked_array(data=['jeans', 'hammer', 'laptop', 'chainsaw', 'iphone'],
#              mask=[False, False, False, False, False],
#        fill_value='?',
#             dtype=object), 'value': masked_array(data=[20.0, 42.2, 2000.0, 500.0, 300.0],
#              mask=[False, False, False, False, False],
#        fill_value=1e+20), 'count': masked_array(data=[1, 2, 1, 10, 2],
#              mask=[False, False, False, False, False],
#        fill_value=999999,
#             dtype=int32)}

# Create Table

1
2
3
4
5
6
7
8
9
# create a table
con.execute("CREATE TABLE items(item VARCHAR, value DECIMAL(10,2), count INTEGER)")
# insert two items into the table
con.execute("INSERT INTO items VALUES ('jeans', 20.0, 1), ('hammer', 42.2, 2)")

# retrieve the items again
con.execute("SELECT * FROM items")
print(con.fetchall())
# [('jeans', 20.0, 1), ('hammer', 42.2, 2)]

# Insert Data

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# insert a row using prepared statements
con.execute("INSERT INTO items VALUES (?, ?, ?)", ['laptop', 2000, 1])

# insert several rows using prepared statements
con.executemany("INSERT INTO items VALUES (?, ?, ?)", [['chainsaw', 500, 10], ['iphone', 300, 2]] )

# query the database using a prepared statement
con.execute("SELECT item FROM items WHERE value > ?", [400])
print(con.fetchall())
# [('laptop',), ('chainsaw',)]

# Upgrade Existing Database (export/import)

Upgrade DuckDB Database (Export and Import)

# Database Connections

# Read from Another DuckDB File within DuckDB

DuckDB - Attach

# S3 Attach: Attach S3 Bucket and as Tables

1
2
3
4
5
ATTACH 'arn:aws:s3tables:us-east-1:111122223333:bucket/bucket_name'
    AS s3_tables_db (
        TYPE iceberg,
        ENDPOINT_TYPE s3_tables
    );

Preview: Amazon S3 Tables in DuckDB – DuckDB

And that’s all! Now, DuckDB is connected to Amazon S3 Tables. To show the available tables, run:

1
SHOW ALL TABLES;
1
2
3
4
5
6
┌──────────────┬─────────┬───────────────┬──────────────┬──────────────┬───────────┐
   database    schema       name       column_names  column_types  temporary 
   varchar     varchar     varchar      varchar[]     varchar[]     boolean  
├──────────────┼─────────┼───────────────┼──────────────┼──────────────┼───────────┤
 s3_tables_db  ducks    duck_species   [__]          [INTEGER]     false     
└──────────────┴─────────┴───────────────┴──────────────┴──────────────┴───────────┘

You can query tables as if they were ordinary DuckDB tables:

1
FROM s3_tables_db.ducks.duck_species;
1
2
3
4
5
6
┌───────┬──────────────┬────────────┐
  id    english_name  latin_name 
 int32    varchar      varchar   
├───────┼──────────────┼────────────┤
   0    Anas nivis    Snow duck  
└───────┴──────────────┴────────────┘

You also have an alternative option to connect to S3 Tables.

# Attach Postgres and Write to Postgres from DuckDB

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
❯ duckdb susa_lzn2.duckdb
v0.10.0 20b1486d11
Enter ".help" for usage hints.
D INSTALL postgres;
D LOAD postgres;

D ATTACH 'dbname=svsa user=postgres password=postgres host=host.docker.internal port=5444' AS pg_db (TYPE postgres);
D select count(*) from initial_load_lzn2_20240325121951.a02t045;
┌──────────────┐
│ count_star()│    int64     │
├──────────────┤
3584412└──────────────┘
D CREATE TABLE pg_db.a02t045 AS SELECT * FROM initial_load_lzn2_20240325121951.a02t045;
100% ▕████████████████████████████████████████████████████████████▏

See more on Fast Postgres to Postgres Initial Load Example with Connector-X and DuckDB by sspaeti · Pull Request #1354 · dlt-hub/dlt · GitHub in combination with dlt.

See also pg_duckdb.

# External Table With Dbt-duckdb

External Materializations can save data in Parquet format automatically (when using dbt with jwills/dbt-duckdb), which would allow Rill Developer or DuckDB to read those aggregated tables.

See more on MotherDuck or the DuckDB course for beginners - YouTube, or awesome-duckdb: 🦆 A curated list of awesome DuckDB resources.

# Comparisons & Alternatives

ClickHouse recently announced with the acquisition of chDB a feature called local. This is interesting as DuckDB tries to become more like OLAP systems such as ClickHouse for large data, and ClickHouse tries to be more like DuckDB with this local approach.

Comparison to Postgres:

Developers are treating PostgreSQL almost like DuckDB—an integrated query engine with a pluggable storage layer. Except, unlike DuckDB, PostgreSQL’s storage layer is far more mature (DuckDB deliberately doesn’t even document their storage format). Extensions automatically inherit all of the transactional goodness that PostgreSQL provides. Databases Are Falling Apart: Database Disassembly and Its Implications


  • DuckDB vs Apache Arrow:
    • DuckDB: In-memory analytical database for querying large datasets
    • Arrow: Cross-language platform for standardized in-memory data representation
  • DuckDB vs ClickHouse:
    • DuckDB: Single-node, in-memory database for embedded analytics
    • ClickHouse: Distributed, columnar storage database for large-scale analytics
  • DuckDB vs SQLite:
    • DuckDB: Optimized for analytical queries on large datasets
    • SQLite: Lightweight, file-based database for general-purpose use
  • DuckDB vs PostgreSQL:
    • DuckDB: In-memory, analytical database focused on complex queries
    • PostgreSQL: Feature-rich, general-purpose relational database
  • DuckDB vs AWS services:
    • DuckDB: Embedded, in-memory analytical database
    • AWS: Offers various cloud-based services (e.g., Redshift, RDS, Athena) for different data processing needs

# Why DuckDB over Polars

Think about Polars like a faster version of pandas due to rust and efficient usage of all the cores on your machine. However, duckdb is a database with full-blown memory management https://duckdb.org/2024/07/09/memory-management.html where working out of core is ingrained in all the operators. Under heavy memory pressure, this tends to perform better / to have fewer crashes. In particular, having a buffer manager is key here. Georg Heiler on LinkedIn: GitHub - feldera/feldera: The Feldera Incremental Computation Engine by Georg Heiler

# Constraints & Limitations

  • Currently no support for Avro
  • A process can write with multiple threads, but multiple processes cannot write concurrently to DuckDB yet.
  • Restrictions by the amount of RAM/CPU/Disk space :)

# Writing Concurrency

Writing concurrency is a challenge for DuckDB. DuckDB isn’t built to handle concurrent writes automatically. It’s focused on one writer at a time.

An external component must handle the concurrency control. This is where Apache Arrow Flight (Protocol) can play a role. More on DuckHouse and the accompanying article by Julien Hurault and Hussain Sultan.

# AI & LLMs

# Scaling

The hosted solution is MotherDuck. An alternative is Ibis (see YouTube, origin).

# Tools & Resources

# Terminal Tool

GitHub - tconbeer/harlequin: The DuckDB IDE for Your Terminal.

# Other Resources

# Example Projects

DuckDB Use Cases

# Future of DuckDB

# 2026

# 2025

Next year (2025) outlook from DuckDB – Overview and latest developments (DuckCon #5, Seattle, 2024) - YouTube:

  • Extension Ecosystem
    • extension in different languages
  • Lakehouse Data Formats:
  • Optimiser Improvements
    • Partition/Sorting Awareness
    • Cardinality Estimation
  • Extensible Parser

# Further Reads


Origin:
References: Philosophies and Guiding Principles behind DuckDB
Created 2022-03-01