🧠 Second Brain

Search

Search IconIcon to open search

DuckDB

Last updated Jun 18, 2024

WIP: This is ans unfinished Note

Below are my uncleaned DuckDB notes, I shared them as these are hopefully helpful to others. Once I have more time, I will convert it into a nice blog or article. For now, it’s a bit messy.

DuckDB is an in-process SQL OLAP database management system. It has strong support for SQL. DuckDB is borrowing the SQLite shell implementation. Each database is a single file on disk. It’s SQLite DB for analytical (OLAP) workloads, whereas SQLite is for a relational database. But it can handle vast amounts of data locally. It’s the smaller, lighter version of Apache Druid and other OLAP technologies.

It’s designed to work as an embedded library, eliminating the network latency you usually get when talking to a database.

Skip working with error-prone Excels or CSVs directly

With DuckDB, we no longer need to use plain files (CSV, Excel, Parquet). DuckDB supports schema, types, and SQL interface and is super fast.

# History

DuckDB is primarily the work of Mark Raasveldt and Hannes Mühleisen. It’s made up of a million lines of C++ and runs as a stand-alone binary. Development is very active with the commit count on its GitHub repo doubling nearly every year since it began in 2018. DuckDB uses PostgreSQL’s SQL parser, Google’s RE2 regular expression engine and SQLite’s shell.

# Installation

On MacOS brew install duckdb, otherwise check DuckDB Installation – DuckDB.

# Use-Cases

DuckDB Labs are the original founders of DuckDB and they announced 2022-11-15 the partnership with MotherDuck see DuckDB Labs - Announcing the MotherDuck Partnership.

MotherDuck is the managed service around DuckDB that lets you scale from a local DB to a cloud DB and hybrid. Done by one of Google BigQuery creators or developers such as Jordan Tigani. Check also his discussion on the Analytics Engineering Podcast about The Personal Data Warehouse. Interesting discussion around connected WebAssembly (WASM) e.g. Is an application compiled to C code, which is super fast. E.g. Figma is using that, which would otherwise never work in a browser.

# Join datasets from multiple sources

Want to join a CSV file, a Postgres table and a folder of parquet files on an S3 bucket? All from the comfort of your DBeaver session, with no server to set up, and have it Just Work™?

1
2
3
4
5
ATTACH 'dbname=mydb' AS pgsql (TYPE POSTGRES);  
SELECT ...  
FROM 'my.csv' c  
INNER JOIN pgsql.my_table p USING (k)  
INNER JOIN parquet_scan('s3:///mybucket/mypqfolder/*.parquet') pq USING (j)  

# Read 10k JSONs

Have a directory of 10000 JSON files you want to quickly aggregate some info from? You could sift through the jq documentation and attempt to conjure some series of piped commands, or you could just write:

1
SELECT myobj.myfld, count(*) from "**/log.json" GROUP BY 1;  

Want all of these results in a pandas dataframe? Also a breeze:

1
duckdb.sql("SELECT ...").to_df()  

And what’s most impressive about all of this is the performance. It consistently surprises me how fast query responses are, and how gracefully it handles data much larger than available memory. LinkedIn Source

# Benchmarks

# Example Projects

# DuckDB Labs vs MotherDuck


DuckDB labs started from Hannes Mühleisen (creator of DuckDB) which is at CWI (University Amsterdam), focusing on DuckDB, MotherDuck is the company that’s going to ship it to the cloud :)

Details DuckDB Labs & MotherDuck in RW DuckDB Labs - Announcing the MotherDuck Partnership.

# Hype

Isn’t the beauty and the hype coming (mostly) from DuckDB itself? Which is free, extremely fast, and developer-friendly. HI haven’t used MotherDuck service yet, but DuckDB things like

1
2
CALL load_aws_credentials();
SELECT * FROM "s3://some-bucket/that/requires/authentication.parquet";

or

1
2
-- read all files that match the glob pattern
SELECT * FROM 'test/*.parquet';

I used DuckDB here and there, and the speed and efficiency, e.g., as a zero-store layer on top of my data lake (e.g., see Build a poor man’s data lake from scratch https://dagster.io/blog/duckdb-data-lake) is pretty cool to me. Also, that you come pretty far with a single file database is somewhat surprising to me.

# Scale Up / Distribution

The hosted solution is MotherDuck. An alternative with Ibis Python(?, see YouTube, origin)

# Technical

# Why it’s so fast?

From Why DuckDB:

DuckDB contains a columnar-vectorized query execution engine, where queries are still interpreted, but a large batch of values (a “vector”) are processed in one operation. This greatly reduces overhead present in traditional systems such as PostgreSQL, MySQL or SQLite which process each row sequentially. Vectorized query execution leads to far better performance in OLAP queries. See more on Columnar vs. Row-bases vs Vector-based

It ships as an amalgamation build - a single giant C++ file (SQLite is a single giant C file). And it’s also backed up by some strong computer science. It’s by the academic researchers behind MonetDB and includes implementations of a bunch of interesting papers:

It’s the culmination of lots of research in columnar systems. Including vectorized compute, decoupled storage, file formats, query plans, join optimization, volcano vs push, etc. Tweet / Twitter

# Comments from Developer Voices with Hannes Mühleisen

# postgres has not been implemented but helped greatly with the speed of DuckDB too.
  1. Smart Use of Modern Hardware: Recognizing that hardware tends to improve faster than data size grows, DuckDB is designed to make the most of the processing power available on typical modern computers. This means that for many practical use cases, DuckDB can handle the data processing needs without requiring distributed systems that add complexity and overhead.
# Philosophies and Guiding Principles of DuckDB

Personal philosophies and first principles they tried to solve from the very beginning, which were the guiding principles for Hannes Mühleisen to follow when building DuckDB:

  1. Simplicity and Usability: One of the core ideas behind DuckDB is to challenge the complexity traditionally associated with setting up and using data management systems. The team aimed to create a database that is straightforward to install and use, removing the barriers often encountered with tasks like configuration, user creation, and database initialization.
  2. Efficient Data Ingestion and Export: Recognizing that getting data into and out of a database is a fundamental operation that should be optimized, DuckDB focuses on making data import and export operations fast and easy. This focus on data movement efficiency directly addresses one of the initial interactions users have with the system, aiming to leave a positive impression and facilitate smooth data workflows.
  3. In-Process Model: DuckDB adopts an in-process model similar to SQLite, where the database runs within the same process as the application. This design choice significantly influences the system’s architecture, as it necessitates careful resource management and crash safety measures since the database shares its runtime environment with the host application.

All with Academic Rigor with Practical Application. The development of DuckDB is deeply rooted in academic research, leveraging advanced concepts in database design and query optimization. However, unlike some academic projects that remain theoretical, DuckDB has been built with a focus on practical application and ease of use. This ensures that it not only incorporates cutting-edge techniques for data processing but is also accessible and beneficial to a wide range of users.

# Comparing to Pandas
  1. Row-by-Row Processing: Pandas typically operates on a row-by-row basis for many operations, which can be less efficient than columnar processing for certain types of data manipulations and analyses. This approach can lead to increased memory usage and slower performance when working with large datasets.
  2. Materialization: Hannes discussed the concept of materialization, where intermediate results are stored in memory during the execution of a sequence of operations. In Pandas, operations might result in the creation of new data structures (like DataFrames) at each step, which can be resource-intensive and slow down computations.

DuckDB’s approach, by contrast, is to use vectorized query execution, where operations are performed on batches of data (chunks of rows) at a time, and it employs a columnar storage format. This method is more efficient for CPUs to process because it can take better advantage of modern hardware capabilities, such as cache utilization and vectorized instructions. The columnar approach also reduces the overhead associated with row-by-row processing and intermediate data materialization, leading to faster execution times, especially for analytical queries that operate on large volumes of data.

# DuckDB on top of Parquet vs. Native DuckDB

If you load into duckdb native, e.g. the NYC Taxi file, you can drastically reduce time as well. But it’s already fast with the zero-copy on top of Parquet.

on top of parquet:

With native duckDB:

# DuckDB Storage Format compared to Parquet

DuckDB has its own storage format. It looks a bit like Parquet, but there are important differences. A DuckDB table is split in “row groups”. A row group contains “column segments”. This is common for analytics storage formats. Parquet also uses it. Terminology is not the same: Parquet says “column chunks” instead of “column segments”. But there are bigger differences: (1) DuckDB does not use general-purpose compression (e.g. Snappy) (2) DuckDB uses a fixed block size (1) leads to a bigger data size, but requires less computation for (de)compression. (2) enables in-place modifications, including updates and deletes. Mutability! DuckDB does use lightweight data-specific compression algorithms. For example run-length encoding and dictionary encoding, which I’ve written about before. Different column segments can use different algorithms. Each segment is analyzed to determine which one is best. Data type and observed patterns (if any) are relevant factors. Mark Raasveldt (DuckDB creator) wrote a great blog about compression in DuckDB.
Jorrit Sandbrink on LinkedIn: #dataengineering #softwareengineering

And the referenced blog Lightweight Compression in DuckDB – DuckDB.

# Why DuckDB for a demo project:

See an example: Open-Source Data Engineering Projects

# Extensions

# DuckDB Delta Lake native integration

Also integration with Unity Catalog announced at Databricks Data & AI Summit 2024

# DuckDB Community Extensions

Hannes also announced the at Databricks Data & AI Summit 2024:

GitHub Repo: duckdb/community-extensions

# Comparison

ClickHouse recently announced a feature called local. Which is interesting as DuckDB tries to become more like OLAP systems such as Clickhouse for large data and Clickhouse tries to be more like DuckDB with this local approach.

Comparison to Postgres:

Developers are treating PostgreSQL almost like DuckDB—an integrated query engine with a pluggable storage layer. Except, unlike DuckDB, PostgreSQL’s storage layer is far more mature (DuckDB deliberately  doesn’t even document their storage format). Extensions automatically inherit all of the transactional goodness that PostgreSQL provides. Databases Are Falling Apart: Database Disassembly and Its Implications

# Terminal Tool to query

GitHub - tconbeer/harlequin: The DuckDB IDE for Your Terminal.

# Python API and Handling Data

DuckDB - Python API

# Fetching Data with SQL

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
# fetch as pandas data frame
df = con.execute("SELECT * FROM items").fetchdf()
print(df)
#        item   value  count
# 0     jeans    20.0      1
# 1    hammer    42.2      2
# 2    laptop  2000.0      1
# 3  chainsaw   500.0     10
# 4    iphone   300.0      2

# fetch as dictionary of numpy arrays
arr = con.execute("SELECT * FROM items").fetchnumpy()
print(arr)
# {'item': masked_array(data=['jeans', 'hammer', 'laptop', 'chainsaw', 'iphone'],
#              mask=[False, False, False, False, False],
#        fill_value='?',
#             dtype=object), 'value': masked_array(data=[20.0, 42.2, 2000.0, 500.0, 300.0],
#              mask=[False, False, False, False, False],
#        fill_value=1e+20), 'count': masked_array(data=[1, 2, 1, 10, 2],
#              mask=[False, False, False, False, False],
#        fill_value=999999,
#             dtype=int32)}

# Create Table

1
2
3
4
5
6
7
8
9
# create a table
con.execute("CREATE TABLE items(item VARCHAR, value DECIMAL(10,2), count INTEGER)")
# insert two items into the table
con.execute("INSERT INTO items VALUES ('jeans', 20.0, 1), ('hammer', 42.2, 2)")

# retrieve the items again
con.execute("SELECT * FROM items")
print(con.fetchall())
# [('jeans', 20.0, 1), ('hammer', 42.2, 2)]

# Insert Data

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# insert a row using prepared statements
con.execute("INSERT INTO items VALUES (?, ?, ?)", ['laptop', 2000, 1])

# insert several rows using prepared statements
con.executemany("INSERT INTO items VALUES (?, ?, ?)", [['chainsaw', 500, 10], ['iphone', 300, 2]] )

# query the database using a prepared statement
con.execute("SELECT item FROM items WHERE value > ?", [400])
print(con.fetchall())
# [('laptop',), ('chainsaw',)]

# Upgrade exiting Database (export/import)

Upgrade DuckDB Database (Export and Import)

# Read from another duckdb file within DuckDB:

DuckDB - Attach

# External Table with dbt-duckdb

External Materializations can save data in Parquet format automatically (when used dbt with jwills/dbt-duckdb), which would allow Rill Developer or DuckDB to read that aggregated tables from.

See more on MotherDuck or the DuckDB course for beginners - YouTube, or awesome-duckdb: 🦆 A curated list of awesome DuckDB resources.

# Attach Postgres and write to Postgres from DuckDB

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
❯ duckdb susa_lzn2.duckdb
v0.10.0 20b1486d11
Enter ".help" for usage hints.
D INSTALL postgres;
D LOAD postgres;

D ATTACH 'dbname=svsa user=postgres password=postgres host=host.docker.internal port=5444' AS pg_db (TYPE postgres);
D select count(*) from initial_load_lzn2_20240325121951.a02t045;
┌──────────────┐
│ count_star()│    int64     │
├──────────────┤
3584412└──────────────┘
D CREATE TABLE pg_db.a02t045 AS SELECT * FROM initial_load_lzn2_20240325121951.a02t045;
100% ▕████████████████████████████████████████████████████████████▏

See more on Fast Postgres to Postgres Initial Load Example with Connector-X and DuckDB by sspaeti · Pull Request #1354 · dlt-hub/dlt · GitHub in combination with dlt.


Origin:
References:
Created 2022-03-01