🧠 Second Brain
Search
DuckDB
WIP: This is an unfinished Note
Below are my uncleaned DuckDB notes, I shared them as these are hopefully helpful to others. Once I have more time, I will convert it into a nice blog or article. For now, it’s a bit messy.
DuckDB is an in-process SQL OLAP database management system. It has strong support for SQL. DuckDB is borrowing the SQLite shell implementation. Each database is a single file on disk. It’s SQLite DB for analytical (OLAP) workloads, whereas SQLite is for a relational database. But it can handle vast amounts of data locally. It’s the smaller, lighter version of Apache Druid and other OLAP technologies.
It’s designed to work as an embedded library, eliminating the network latency you usually get when talking to a database.
Skip working with error-prone Excels or CSVs directly
With DuckDB, we no longer need to use plain files (CSV, Excel, Parquet). DuckDB supports schema, types, and SQL interface and is super fast.
# Why DuckDB
Abstract: Why DuckDB
DuckDB is an innovative in-process analytical database management system that combines simplicity, portability, and high performance. Designed as an embedded SQL OLAP database, DuckDB addresses the need for efficient data analysis on local machines without the complexity of traditional database setups.
Key features of DuckDB include:
- Simplicity: With no external dependencies and an embedded architecture, DuckDB eliminates the need for separate server processes or complex configurations.
- Portability: DuckDB runs on various operating systems and CPU architectures, from edge devices to powerful servers, and even in web browsers via WebAssembly.
- Performance: Utilizing a columnar-vectorized query execution engine, DuckDB achieves superior speed for analytical workloads compared to row-based systems.
- Feature-rich: Despite its lightweight nature, DuckDB offers advanced SQL support, ACID compliance, and integration with popular data science tools.
- Extensibility: A flexible extension mechanism allows for customization and additional functionality.
- Open-source: Released under the MIT License, DuckDB is freely available and open for community contributions.
DuckDB fills a unique niche in the database ecosystem, offering SQLite-like simplicity for analytical tasks. It’s particularly well-suited for data scientists and analysts working with local datasets, providing a fast, free, and user-friendly alternative to traditional data warehousing solutions for many use cases.
^f86df0
More on Why DuckDB.
# History
DuckDB is primarily the work of Mark Raasveldt and Hannes Mühleisen. It’s made up of a million lines of C++ and runs as a stand-alone binary. Development is very active with the commit count on its GitHub repo doubling nearly every year since it began in 2018. DuckDB uses PostgreSQL’s SQL parser, Google’s RE2 regular expression engine and SQLite’s shell.
# DuckDB Labs vs MotherDuck
DuckDB Labs was started by Hannes Mühleisen and Mark Raasveldt at the Database Architectures Group at the Centrum Wiskunde & Informatica (CWI) in Amsterdam, The Netherlands. They also founded DuckDB Labs, focusing on commercial services around DuckDB. MotherDuck is the US-based company that is VC-funded, which ships distributed DuckDB services. More details in Announcing the MotherDuck Partnership.
DuckDB Labs are the original founders of DuckDB and they announced 2022-11-15 the partnership with MotherDuck see RW Announcing the MotherDuck Partnership.
MotherDuck is the managed service around DuckDB that lets you scale from a local DB to a cloud DB and hybrid. Done by one of Google BigQuery creators or developers such as Jordan Tigani. Check also his discussion on the Analytics Engineering Podcast about The Personal Data Warehouse. Interesting discussion around connected WebAssembly (WASM) e.g. Is an application compiled to C code, which is super fast. E.g. Figma is using that, which would otherwise never work in a browser.
# Release Calendar
# Installation
On MacOS brew install duckdb
, otherwise check
DuckDB Installation – DuckDB.
- Manage version of DuckDB with
duckdb-version-manager called Duckman: Manage your DuckDB CLI with ease
- E.g.
duckman install 0.10.2
,duckman default 0.10.2
,duckman run nightly
orduckman list remote
- E.g.
# Use Cases
DuckDB is categorized as an analytical and embedded database.
Transactional | Analytical | |
---|---|---|
Embedded | SQLite | DuckDB |
Stand-alone | Oracle, Postgres, MySQL | Cloud DWHs: BigQuery, Snowflake, Redshift Modern OLAP: Druid, ClickHouse, Pinot |
Table inspired by Oliver Molander in Better Programming
# Categories by DuckDB
The questions are usually:
- Is DuckDB like Snowflake? Not really.
- Is DuckDB like PostgreSQL? No, no, cousins, maybe?
- Is DuckDB like Pandas? It’s complicated.
- Is DuckDB like SQLite? yes, no!
- Is DuckDB like Apache Spark? Interesting.
To better understand DuckDB and how it fits into use cases within large enterprises, let’s define the categories in which it can be used. The table below highlights DuckDB’s versatility.
graph TD DuckDB[DuckDB] DuckDB --> Interactive[Interactive Data Apps] DuckDB --> Extensible[Pipeline Compute Engine] DuckDB --> CostEffective[Lightweight SQL Analytics Solution] DuckDB --> Secure[Secure Enterprise Data Handler] DuckDB --> ZeroCopy[Zero-Copy SQL Connector] Interactive --> |Embeddable| I[Browser/App Integration] Extensible --> |On-Demand & High-Performance| E[SQL Workflows] CostEffective --> |Single-node| C[Compute Engine] Secure --> |Enhanced Security| S[Compliant Processing] ZeroCopy --> |Virtualized| Z[SQL Query Engine] classDef useCase fill:#FFEF03,stroke:#333,stroke-width:2px,color:#000 class Interactive,Extensible,CostEffective,Secure,ZeroCopy useCase;
See more on The Enterprise Case for DuckDB: 5 Key Categories and Why Use It | ssp.sh. ^4d2559
Simple one as of 2024-09-16, categories by Hannes Mühleisen:
Image from
DuckDB – Overview and latest developments (DuckCon #5, Seattle, 2024) - YouTube
# Multi-categories by Alireza Sadeghi
As Alireza Sadeghi would explain it:
- Embeddable
- columnar OLAP database
- interoperable SQL-based data frames
- federated, zero-copy Query Engine
- single-node compute engine
By DuckDB Beyond the Hype - by Alireza Sadeghi
# Curated Use-Cases
See extensive use cases in DuckDB Use Cases.
# Benchmarks
- Fast Analysis With DuckDB + Pyarrow - 2022
- SQL on Python, part 1: The simplicity of DuckDB: How to use DuckDB to analyze 4.6+ million mentions of climate change on Reddit
- Tweet: Impressively fast, collaborative exploratory data analytics over a 20+ million row data set, hosted in the cloud with Drifting’s Jamsocket + Rill Data + DuckDB - 2022
- Taking DuckDB for a spin | Uwe’s Blog - 2019
# Example Projects
# Hype & Trends
Isn’t the beauty and the hype coming (mostly) from DuckDB itself? Which is free, extremely fast, and developer-friendly. HI haven’t used MotherDuck service yet, but DuckDB things like
|
|
or
|
|
I used DuckDB here and there, and the speed and efficiency, e.g., as a zero-store layer on top of my data lake (e.g., see Build a poor man’s data lake from scratch https://dagster.io/blog/duckdb-data-lake) is pretty cool to me. Also, that you come pretty far with a single file database is somewhat surprising to me.
# Constraints / Limitations
- currently no support for Avro
- a process can write with multiple threads, but not multiple processes can write concurrently to DuckDB yet.
- Restrictions by the amount of RAM/CPU/Disk space :)
# Market Trends
Google Trend:
duckdb - Explore - Google Trends
GitHub Stars - Hockey-stick
GitHub Star History
Niche, in comparison to SQLite:
duckdb, /m/01kgq9 - Explore - Google Trends
# Numbers
From
DuckDB – Overview and latest developments (DuckCon #5, Seattle, 2024) - YouTube:
# LLMs and Models for DuckDB
- 7B parameter text-to-SQL model made by MotherDuck and Numbers Station. duckdb-nsql, more on Text-to-SQL LLM - DuckDB
- AI in SQL prompt: Introducing the prompt() Function: Use the Power of LLMs with SQL!
# Scale Up / Distribution
The hosted solution is MotherDuck. An alternative with Ibis Python(?, see YouTube, origin)
# Technical
DuckDB is a columnar-vectorized query execution engine.
# Why it’s so fast?
From Why DuckDB
DuckDB contains a columnar-vectorized query execution engine, where queries are still interpreted, but a large batch of values (a “vector”) are processed in one operation. This greatly reduces overhead present in traditional systems such as PostgreSQL, MySQL or SQLite which process each row sequentially. Vectorized query execution leads to far better performance in OLAP queries. See more on Columnar vs. Row-bases vs Vector-based
It ships as an amalgamation build - a single giant C++ file (SQLite is a single giant C file). And it’s also backed up by some strong computer science. It’s by the academic researchers behind MonetDB and includes implementations of a bunch of interesting papers:
- Data Management for Data Science - Towards Embedded Analytics (CIDR 2020)
- DuckDB: an Embeddable Analytical Database (SIGMOD 2019 Demo)
Culmination of Factors
It’s the culmination of lots of research in columnar systems. Including vectorized compute, decoupled storage, file formats, query plans, join optimization, volcano vs push, etc. Tweet / Twitter. See also Differential Storage
Also check out Vectorized Query Engine.
# Comments from Developer Voices with Hannes Mühleisen
# postgres has not been implemented but helped greatly with the speed of DuckDB too.
- Smart Use of Modern Hardware: Recognizing that hardware tends to improve faster than data size grows, DuckDB is designed to make the most of the processing power available on typical modern computers. This means that for many practical use cases, DuckDB can handle the data processing needs without requiring distributed systems that add complexity and overhead.
# Philosophies and Guiding Principles of DuckDB
Personal philosophies and first principles they tried to solve from the very beginning, which were the guiding principles for Hannes Mühleisen to follow when building DuckDB:
- Simplicity and Usability: One of the core ideas behind DuckDB is to challenge the complexity traditionally associated with setting up and using data management systems. The team aimed to create a database that is straightforward to install and use, removing the barriers often encountered with tasks like configuration, user creation, and database initialization.
- Efficient Data Ingestion and Export: Recognizing that getting data into and out of a database is a fundamental operation that should be optimized, DuckDB focuses on making data import and export operations fast and easy. This focus on data movement efficiency directly addresses one of the initial interactions users have with the system, aiming to leave a positive impression and facilitate smooth data workflows.
- In-Process Model: DuckDB adopts an in-process model similar to SQLite, where the database runs within the same process as the application. This design choice significantly influences the system’s architecture, as it necessitates careful resource management and crash safety measures since the database shares its runtime environment with the host application.
All with Academic Rigor with Practical Application. The development of DuckDB is deeply rooted in academic research, leveraging advanced concepts in database design and query optimization. However, unlike some academic projects that remain theoretical, DuckDB has been built with a focus on practical application and ease of use. This ensures that it not only incorporates cutting-edge techniques for data processing but is also accessible and beneficial to a wide range of users.
# Comparing to Pandas
- Row-by-Row Processing: Pandas typically operates on a row-by-row basis for many operations, which can be less efficient than columnar processing for certain types of data manipulations and analyses. This approach can lead to increased memory usage and slower performance when working with large datasets.
- Materialization: Hannes discussed the concept of materialization, where intermediate results are stored in memory during the execution of a sequence of operations. In Pandas, operations might result in the creation of new data structures (like DataFrames) at each step, which can be resource-intensive and slow down computations.
DuckDB’s approach, by contrast, is to use vectorized query execution, where operations are performed on batches of data (chunks of rows) at a time, and it employs a columnar storage format. This method is more efficient for CPUs to process because it can take better advantage of modern hardware capabilities, such as cache utilization and vectorized instructions. The columnar approach also reduces the overhead associated with row-by-row processing and intermediate data materialization, leading to faster execution times, especially for analytical queries that operate on large volumes of data.
# DuckDB on top of Parquet vs. Native DuckDB
If you load into duckdb native, e.g. the NYC Taxi file, you can drastically reduce time as well. But it’s already fast with the zero-copy on top of Parquet.
on top of parquet:
With native duckDB:
# DuckDB Storage Format compared to Parquet
DuckDB has its own storage format. It looks a bit like Parquet, but there are important differences. A DuckDB table is split in “row groups”. A row group contains “column segments”. This is common for analytics storage formats. Parquet also uses it. Terminology is not the same: Parquet says “column chunks” instead of “column segments”. But there are bigger differences: (1) DuckDB does not use general-purpose compression (e.g. Snappy) (2) DuckDB uses a fixed block size (1) leads to a bigger data size, but requires less computation for (de)compression. (2) enables in-place modifications, including updates and deletes. Mutability! DuckDB does use lightweight data-specific compression algorithms. For example run-length encoding and dictionary encoding, which I’ve written about before. Different column segments can use different algorithms. Each segment is analyzed to determine which one is best. Data type and observed patterns (if any) are relevant factors. Mark Raasveldt (DuckDB creator) wrote a great blog about compression in DuckDB.
Jorrit Sandbrink on LinkedIn: #dataengineering #softwareengineering
And the referenced blog Lightweight Compression in DuckDB – DuckDB.
# Extensions
# DuckDB Delta Lake native integration
- DuckDB can read Delta Lake (2024-06-10) out of the box now with their own Parquet reader with delta-rs and the Delta Kernel. Read more on Native Delta Lake Support in DuckDB – DuckDB
Also integration with Unity Catalog announced at Databricks Data & AI Summit 2024
# DuckDB Community Extensions
See DuckDB Community Extensions.
# Comparison and Alternatives
ClickHouse recently announced with the aquisition of chDB a feature called local
. Which is interesting as DuckDB tries to become more like OLAP systems such as ClickHouse for large data and Clickhouse tries to be more like DuckDB with this local approach.
Comparison to Postgres:
Developers are treating PostgreSQL almost like DuckDB—an integrated query engine with a pluggable storage layer. Except, unlike DuckDB, PostgreSQL’s storage layer is far more mature (DuckDB deliberately doesn’t even document their storage format). Extensions automatically inherit all of the transactional goodness that PostgreSQL provides. Databases Are Falling Apart: Database Disassembly and Its Implications
- DuckDB vs Apache Arrow:
- DuckDB: In-memory analytical database for querying large datasets
- Arrow: Cross-language platform for standardized in-memory data representation
- DuckDB vs ClickHouse:
- DuckDB: Single-node, in-memory database for embedded analytics
- ClickHouse: Distributed, columnar storage database for large-scale analytics
- DuckDB vs SQLite:
- DuckDB: Optimized for analytical queries on large datasets
- SQLite: Lightweight, file-based database for general-purpose use
- DuckDB vs PostgreSQL:
- DuckDB: In-memory, analytical database focused on complex queries
- PostgreSQL: Feature-rich, general-purpose relational database
- DuckDB vs AWS services:
- DuckDB: Embedded, in-memory analytical database
- AWS: Offers various cloud-based services (e.g., Redshift, RDS, Athena) for different data processing needs
# Why DuckDB over Polars
Think about Polars like a faster version of pandas due to rust and efficient usage of all the cores on your machine. However, duckdb is a database with full-blown memory management https://duckdb.org/2024/07/09/memory-management.html where working out of core is ingrained in all the operators. Under heavy memory pressure, this tends to perform better / to have fewer crashes. In particular, having a buffer manager is key here. Georg Heiler on LinkedIn: GitHub - feldera/feldera: The Feldera Incremental Computation Engine by Georg Heiler
# Terminal Tool to query
GitHub - tconbeer/harlequin: The DuckDB IDE for Your Terminal.
# Code Examples
# Python API and Handling Data
# Fetching Data with SQL
|
|
# Create Table
|
|
# Insert Data
|
|
# Upgrade exiting Database (export/import)
Upgrade DuckDB Database (Export and Import)
# Read from another duckdb file within DuckDB:
# External Table with dbt-duckdb
External Materializations can save data in Parquet format automatically (when used dbt with jwills/dbt-duckdb), which would allow Rill Developer or DuckDB to read that aggregated tables from.
See more on MotherDuck or the DuckDB course for beginners - YouTube, or awesome-duckdb: 🦆 A curated list of awesome DuckDB resources.
# Attach Postgres and write to Postgres from DuckDB
|
|
See more on Fast Postgres to Postgres Initial Load Example with Connector-X and DuckDB by sspaeti · Pull Request #1354 · dlt-hub/dlt · GitHub in combination with dlt.
See also pg_duckdb.
# Secret handling
# Listing Secrets
|
|
# Creating Secrets
Create one based on ENV vars:
|
|
# Persistent secret
|
|
# Temporary secret with keys
|
|
See more on Secrets Manager – DuckDB
# Future of DuckDB
Next year (2025) outlook from DuckDB – Overview and latest developments (DuckCon #5, Seattle, 2024) - YouTube:
- Extension Ecosystem
- extension in different languages
- Lakehouse Data Formats:
- Support on write to Table Formats
- Optimiser Improvements
- Partition/Sorting Awareness
- Cardinality Estimation
- Extensible Parser
# Other resources
-
DuckDB Beyond the Hype by Alireza Sadeghi
- Federated Query Engine or zero-copy:
- Interoperable SQL-Powered Dataframes
- check also DuckDB
- Reddit comment for more insights.
- Useful Snippets: SQL, Python & More for DuckDB | DuckDB Snippets
- Federated Query Engine or zero-copy:
Origin:
References:
Created 2022-03-01