🧠 Second Brain

Search

Search IconIcon to open search

DuckDB

Last updated Jan 20, 2025

WIP: This is an unfinished Note

Below are my uncleaned DuckDB notes, I shared them as these are hopefully helpful to others. Once I have more time, I will convert it into a nice blog or article. For now, it’s a bit messy.

DuckDB is an in-process SQL OLAP database management system. It has strong support for SQL. DuckDB is borrowing the SQLite shell implementation. Each database is a single file on disk. It’s SQLite DB for analytical (OLAP) workloads, whereas SQLite is for a relational database. But it can handle vast amounts of data locally. It’s the smaller, lighter version of Apache Druid and other OLAP technologies.

It’s designed to work as an embedded library, eliminating the network latency you usually get when talking to a database.

Skip working with error-prone Excels or CSVs directly

With DuckDB, we no longer need to use plain files (CSV, Excel, Parquet). DuckDB supports schema, types, and SQL interface and is super fast.

# Why DuckDB

Abstract: Why DuckDB

DuckDB is an innovative in-process analytical database management system that combines simplicity, portability, and high performance. Designed as an embedded SQL OLAP database, DuckDB addresses the need for efficient data analysis on local machines without the complexity of traditional database setups.

Key features of DuckDB include:

  1. Simplicity: With no external dependencies and an embedded architecture, DuckDB eliminates the need for separate server processes or complex configurations.
  2. Portability: DuckDB runs on various operating systems and CPU architectures, from edge devices to powerful servers, and even in web browsers via WebAssembly.
  3. Performance: Utilizing a columnar-vectorized query execution engine, DuckDB achieves superior speed for analytical workloads compared to row-based systems.
  4. Feature-rich: Despite its lightweight nature, DuckDB offers advanced SQL support, ACID compliance, and integration with popular data science tools.
  5. Extensibility: A flexible extension mechanism allows for customization and additional functionality.
  6. Open-source: Released under the MIT License, DuckDB is freely available and open for community contributions.

DuckDB fills a unique niche in the database ecosystem, offering SQLite-like simplicity for analytical tasks. It’s particularly well-suited for data scientists and analysts working with local datasets, providing a fast, free, and user-friendly alternative to traditional data warehousing solutions for many use cases.

^f86df0

More on Why DuckDB.

# History

DuckDB is primarily the work of Mark Raasveldt and Hannes Mühleisen. It’s made up of a million lines of C++ and runs as a stand-alone binary. Development is very active with the commit count on its GitHub repo doubling nearly every year since it began in 2018. DuckDB uses PostgreSQL’s SQL parser, Google’s RE2 regular expression engine and SQLite’s shell.

# DuckDB Labs vs MotherDuck

DuckDB Labs was started by Hannes Mühleisen and Mark Raasveldt at the Database Architectures Group at the Centrum Wiskunde & Informatica (CWI) in Amsterdam, The Netherlands. They also founded DuckDB Labs, focusing on commercial services around DuckDB. MotherDuck is the US-based company that is VC-funded, which ships distributed DuckDB services. More details in Announcing the MotherDuck Partnership.

DuckDB Labs are the original founders of DuckDB and they announced 2022-11-15 the partnership with MotherDuck see RW Announcing the MotherDuck Partnership.

MotherDuck is the managed service around DuckDB that lets you scale from a local DB to a cloud DB and hybrid. Done by one of Google BigQuery creators or developers such as Jordan Tigani. Check also his discussion on the Analytics Engineering Podcast about The Personal Data Warehouse. Interesting discussion around connected WebAssembly (WASM) e.g. Is an application compiled to C code, which is super fast. E.g. Figma is using that, which would otherwise never work in a browser.

# Release Calendar

# Installation

On MacOS brew install duckdb, otherwise check DuckDB Installation – DuckDB.

# Use Cases

DuckDB is categorized as an analytical and embedded database.

Transactional Analytical
Embedded SQLite DuckDB
Stand-alone Oracle, Postgres, MySQL Cloud DWHs: BigQuery, Snowflake, Redshift
Modern OLAP: Druid, ClickHouse, Pinot

Table inspired by Oliver Molander in Better Programming

# Categories by DuckDB

The questions are usually:

To better understand DuckDB and how it fits into use cases within large enterprises, let’s define the categories in which it can be used. The table below highlights DuckDB’s versatility.

graph TD
    DuckDB[DuckDB]
    DuckDB --> Interactive[Interactive Data Apps]
    DuckDB --> Extensible[Pipeline Compute Engine]
    DuckDB --> CostEffective[Lightweight SQL Analytics Solution]
    DuckDB --> Secure[Secure Enterprise Data Handler]
    DuckDB --> ZeroCopy[Zero-Copy SQL Connector]

    Interactive --> |Embeddable| I[Browser/App Integration]
    Extensible --> |On-Demand & High-Performance| E[SQL Workflows]
    CostEffective --> |Single-node| C[Compute Engine]
    Secure --> |Enhanced Security| S[Compliant Processing]
    ZeroCopy --> |Virtualized| Z[SQL Query Engine]

    classDef useCase fill:#FFEF03,stroke:#333,stroke-width:2px,color:#000
    class Interactive,Extensible,CostEffective,Secure,ZeroCopy useCase;

See more on The Enterprise Case for DuckDB: 5 Key Categories and Why Use It | ssp.sh. ^4d2559

Simple one as of 2024-09-16, categories by Hannes Mühleisen:

Image from DuckDB – Overview and latest developments (DuckCon #5, Seattle, 2024) - YouTube

# Multi-categories by Alireza Sadeghi

As Alireza Sadeghi would explain it:

# Curated Use-Cases

See extensive use cases in DuckDB Use Cases.

# Benchmarks

# Example Projects

DuckDB Use Cases

Isn’t the beauty and the hype coming (mostly) from DuckDB itself? Which is free, extremely fast, and developer-friendly. HI haven’t used MotherDuck service yet, but DuckDB things like

1
2
CALL load_aws_credentials();
SELECT * FROM "s3://some-bucket/that/requires/authentication.parquet";

or

1
2
-- read all files that match the glob pattern
SELECT * FROM 'test/*.parquet';

I used DuckDB here and there, and the speed and efficiency, e.g., as a zero-store layer on top of my data lake (e.g., see Build a poor man’s data lake from scratch https://dagster.io/blog/duckdb-data-lake) is pretty cool to me. Also, that you come pretty far with a single file database is somewhat surprising to me.

# Constraints / Limitations

Google Trend:

duckdb - Explore - Google Trends

GitHub Stars - Hockey-stick

GitHub Star History

Niche, in comparison to SQLite:

duckdb, /m/01kgq9 - Explore - Google Trends

# Numbers

From DuckDB – Overview and latest developments (DuckCon #5, Seattle, 2024) - YouTube:

# LLMs and Models for DuckDB

# Scale Up / Distribution

The hosted solution is MotherDuck. An alternative with Ibis Python(?, see YouTube, origin)

# Technical

DuckDB is a columnar-vectorized query execution engine.

# Why it’s so fast?

From Why DuckDB

DuckDB contains a columnar-vectorized query execution engine, where queries are still interpreted, but a large batch of values (a “vector”) are processed in one operation. This greatly reduces overhead present in traditional systems such as PostgreSQL, MySQL or SQLite which process each row sequentially. Vectorized query execution leads to far better performance in OLAP queries. See more on Columnar vs. Row-bases vs Vector-based

It ships as an amalgamation build - a single giant C++ file (SQLite is a single giant C file). And it’s also backed up by some strong computer science. It’s by the academic researchers behind MonetDB and includes implementations of a bunch of interesting papers:

Culmination of Factors

It’s the culmination of lots of research in columnar systems. Including vectorized compute, decoupled storage, file formats, query plans, join optimization, volcano vs push, etc. Tweet / Twitter. See also Differential Storage

Also check out Vectorized Query Engine.

# Comments from Developer Voices with Hannes Mühleisen

# postgres has not been implemented but helped greatly with the speed of DuckDB too.
  1. Smart Use of Modern Hardware: Recognizing that hardware tends to improve faster than data size grows, DuckDB is designed to make the most of the processing power available on typical modern computers. This means that for many practical use cases, DuckDB can handle the data processing needs without requiring distributed systems that add complexity and overhead.
# Philosophies and Guiding Principles of DuckDB

Personal philosophies and first principles they tried to solve from the very beginning, which were the guiding principles for Hannes Mühleisen to follow when building DuckDB:

  1. Simplicity and Usability: One of the core ideas behind DuckDB is to challenge the complexity traditionally associated with setting up and using data management systems. The team aimed to create a database that is straightforward to install and use, removing the barriers often encountered with tasks like configuration, user creation, and database initialization.
  2. Efficient Data Ingestion and Export: Recognizing that getting data into and out of a database is a fundamental operation that should be optimized, DuckDB focuses on making data import and export operations fast and easy. This focus on data movement efficiency directly addresses one of the initial interactions users have with the system, aiming to leave a positive impression and facilitate smooth data workflows.
  3. In-Process Model: DuckDB adopts an in-process model similar to SQLite, where the database runs within the same process as the application. This design choice significantly influences the system’s architecture, as it necessitates careful resource management and crash safety measures since the database shares its runtime environment with the host application.

All with Academic Rigor with Practical Application. The development of DuckDB is deeply rooted in academic research, leveraging advanced concepts in database design and query optimization. However, unlike some academic projects that remain theoretical, DuckDB has been built with a focus on practical application and ease of use. This ensures that it not only incorporates cutting-edge techniques for data processing but is also accessible and beneficial to a wide range of users.

# Comparing to Pandas
  1. Row-by-Row Processing: Pandas typically operates on a row-by-row basis for many operations, which can be less efficient than columnar processing for certain types of data manipulations and analyses. This approach can lead to increased memory usage and slower performance when working with large datasets.
  2. Materialization: Hannes discussed the concept of materialization, where intermediate results are stored in memory during the execution of a sequence of operations. In Pandas, operations might result in the creation of new data structures (like DataFrames) at each step, which can be resource-intensive and slow down computations.

DuckDB’s approach, by contrast, is to use vectorized query execution, where operations are performed on batches of data (chunks of rows) at a time, and it employs a columnar storage format. This method is more efficient for CPUs to process because it can take better advantage of modern hardware capabilities, such as cache utilization and vectorized instructions. The columnar approach also reduces the overhead associated with row-by-row processing and intermediate data materialization, leading to faster execution times, especially for analytical queries that operate on large volumes of data.

# DuckDB on top of Parquet vs. Native DuckDB

If you load into duckdb native, e.g. the NYC Taxi file, you can drastically reduce time as well. But it’s already fast with the zero-copy on top of Parquet.

on top of parquet:

With native duckDB:

# DuckDB Storage Format compared to Parquet

DuckDB has its own storage format. It looks a bit like Parquet, but there are important differences. A DuckDB table is split in “row groups”. A row group contains “column segments”. This is common for analytics storage formats. Parquet also uses it. Terminology is not the same: Parquet says “column chunks” instead of “column segments”. But there are bigger differences: (1) DuckDB does not use general-purpose compression (e.g. Snappy) (2) DuckDB uses a fixed block size (1) leads to a bigger data size, but requires less computation for (de)compression. (2) enables in-place modifications, including updates and deletes. Mutability! DuckDB does use lightweight data-specific compression algorithms. For example run-length encoding and dictionary encoding, which I’ve written about before. Different column segments can use different algorithms. Each segment is analyzed to determine which one is best. Data type and observed patterns (if any) are relevant factors. Mark Raasveldt (DuckDB creator) wrote a great blog about compression in DuckDB.
Jorrit Sandbrink on LinkedIn: #dataengineering #softwareengineering

And the referenced blog Lightweight Compression in DuckDB – DuckDB.

# Extensions

# DuckDB Delta Lake native integration

Also integration with Unity Catalog announced at Databricks Data & AI Summit 2024

# DuckDB Community Extensions

See DuckDB Community Extensions.

# Comparison and Alternatives

ClickHouse recently announced with the aquisition of chDB a feature called local. Which is interesting as DuckDB tries to become more like OLAP systems such as ClickHouse for large data and Clickhouse tries to be more like DuckDB with this local approach.

Comparison to Postgres:

Developers are treating PostgreSQL almost like DuckDB—an integrated query engine with a pluggable storage layer. Except, unlike DuckDB, PostgreSQL’s storage layer is far more mature (DuckDB deliberately  doesn’t even document their storage format). Extensions automatically inherit all of the transactional goodness that PostgreSQL provides. Databases Are Falling Apart: Database Disassembly and Its Implications


# Why DuckDB over Polars

Think about Polars like a faster version of pandas due to rust and efficient usage of all the cores on your machine. However, duckdb is a database with full-blown memory management https://duckdb.org/2024/07/09/memory-management.html where working out of core is ingrained in all the operators. Under heavy memory pressure, this tends to perform better / to have fewer crashes. In particular, having a buffer manager is key here. Georg Heiler on LinkedIn: GitHub - feldera/feldera: The Feldera Incremental Computation Engine by Georg Heiler

# Terminal Tool to query

GitHub - tconbeer/harlequin: The DuckDB IDE for Your Terminal.

# Code Examples

# Python API and Handling Data

DuckDB - Python API

# Fetching Data with SQL

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
# fetch as pandas data frame
df = con.execute("SELECT * FROM items").fetchdf()
print(df)
#        item   value  count
# 0     jeans    20.0      1
# 1    hammer    42.2      2
# 2    laptop  2000.0      1
# 3  chainsaw   500.0     10
# 4    iphone   300.0      2

# fetch as dictionary of numpy arrays
arr = con.execute("SELECT * FROM items").fetchnumpy()
print(arr)
# {'item': masked_array(data=['jeans', 'hammer', 'laptop', 'chainsaw', 'iphone'],
#              mask=[False, False, False, False, False],
#        fill_value='?',
#             dtype=object), 'value': masked_array(data=[20.0, 42.2, 2000.0, 500.0, 300.0],
#              mask=[False, False, False, False, False],
#        fill_value=1e+20), 'count': masked_array(data=[1, 2, 1, 10, 2],
#              mask=[False, False, False, False, False],
#        fill_value=999999,
#             dtype=int32)}

# Create Table

1
2
3
4
5
6
7
8
9
# create a table
con.execute("CREATE TABLE items(item VARCHAR, value DECIMAL(10,2), count INTEGER)")
# insert two items into the table
con.execute("INSERT INTO items VALUES ('jeans', 20.0, 1), ('hammer', 42.2, 2)")

# retrieve the items again
con.execute("SELECT * FROM items")
print(con.fetchall())
# [('jeans', 20.0, 1), ('hammer', 42.2, 2)]

# Insert Data

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# insert a row using prepared statements
con.execute("INSERT INTO items VALUES (?, ?, ?)", ['laptop', 2000, 1])

# insert several rows using prepared statements
con.executemany("INSERT INTO items VALUES (?, ?, ?)", [['chainsaw', 500, 10], ['iphone', 300, 2]] )

# query the database using a prepared statement
con.execute("SELECT item FROM items WHERE value > ?", [400])
print(con.fetchall())
# [('laptop',), ('chainsaw',)]

# Upgrade exiting Database (export/import)

Upgrade DuckDB Database (Export and Import)

# Read from another duckdb file within DuckDB:

DuckDB - Attach

# External Table with dbt-duckdb

External Materializations can save data in Parquet format automatically (when used dbt with jwills/dbt-duckdb), which would allow Rill Developer or DuckDB to read that aggregated tables from.

See more on MotherDuck or the DuckDB course for beginners - YouTube, or awesome-duckdb: 🦆 A curated list of awesome DuckDB resources.

# Attach Postgres and write to Postgres from DuckDB

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
❯ duckdb susa_lzn2.duckdb
v0.10.0 20b1486d11
Enter ".help" for usage hints.
D INSTALL postgres;
D LOAD postgres;

D ATTACH 'dbname=svsa user=postgres password=postgres host=host.docker.internal port=5444' AS pg_db (TYPE postgres);
D select count(*) from initial_load_lzn2_20240325121951.a02t045;
┌──────────────┐
│ count_star()│    int64     │
├──────────────┤
3584412└──────────────┘
D CREATE TABLE pg_db.a02t045 AS SELECT * FROM initial_load_lzn2_20240325121951.a02t045;
100% ▕████████████████████████████████████████████████████████████▏

See more on Fast Postgres to Postgres Initial Load Example with Connector-X and DuckDB by sspaeti · Pull Request #1354 · dlt-hub/dlt · GitHub in combination with dlt.

See also pg_duckdb.

# Secret handling

# Listing Secrets

1
FROM duckdb_secrets();

# Creating Secrets

Create one based on ENV vars:

1
2
3
4
5
CREATE SECRET secret3 (
    TYPE S3,
    PROVIDER CREDENTIAL_CHAIN,
    CHAIN 'env;config'
);

# Persistent secret

1
2
3
4
5
CREATE PERSISTENT SECRET my_persistent_secret (
    TYPE S3,
    KEY_ID 'my_secret_key',
    SECRET 'my_secret_value'
);

# Temporary secret with keys

1
2
3
4
5
6
CREATE SECRET my_secret (
    TYPE S3,
    KEY_ID 'my_secret_key',
    SECRET 'my_secret_value',
    REGION 'my_region'
);

See more on Secrets Manager – DuckDB

# Future of DuckDB

Next year (2025) outlook from DuckDB – Overview and latest developments (DuckCon #5, Seattle, 2024) - YouTube:

# Other resources


Origin:
References:
Created 2022-03-01