Kaggle is the world’s largest Data Science community with powerful tools and resources to help you achieve your data science goals.
# Datasets
See datasets in Find good Data Sets or Sources:
!Find good Data Sets or Sources
# Using DuckDB to Read Datasets with Extension
Gaggle is a DuckDB extension for working with Kaggle datasets. TL;DR: You can directly access and query Kaggle datasets with kaggle: as if they were local tables, abstracting away the download, extraction, and caching process.
See how this works with example. First setup API keys with - Go to kaggle.com/settings API section → "Create New Token" and export it:
1
2
|
export KAGGLE_USERNAME="your_username"
export KAGGLE_KEY="your_api_key"
|
Then start DuckDB and install the extension:
1
2
|
install gaggle from community;
load gaggle;
|
List boston house prices (this dataset here:
Boston House Prices):
1
2
3
4
5
6
7
8
|
-- List files (no recursive parameter in released version)
SELECT * FROM gaggle_ls('vikrishnan/boston-house-prices');
┌─────────────┬───────┬─────────────────────────────────────────────────────────────────────────────────┐
│ name │ size │ path │
│ varchar │ int64 │ varchar │
├─────────────┼───────┼─────────────────────────────────────────────────────────────────────────────────┤
│ housing.csv │ 0 │ /home/sspaeti/.cache/gaggle/datasets/vikrishnan/boston-house-prices/housing.csv │
└─────────────┴───────┴─────────────────────────────────────────────────────────────────────────────────┘
|
Query dataset directly:
1
2
3
4
5
6
7
8
9
10
11
|
SELECT * FROM 'kaggle:vikrishnan/boston-house-prices/housing.csv' LIMIT 5;
┌──────────────────────────────────────────────────────────────────────────────────────────────────┐
│ 0.00632 18.00 2.310 0 0.5380 6.5750 65.20 4.0900 1 296.0 15.30 396.90 4.98 24.00 │
│ varchar │
├──────────────────────────────────────────────────────────────────────────────────────────────────┤
│ 0.02731 0.00 7.070 0 0.4690 6.4210 78.90 4.9671 2 242.0 17.80 396.90 9.14 21.60 │
│ 0.02729 0.00 7.070 0 0.4690 7.1850 61.10 4.9671 2 242.0 17.80 392.83 4.03 34.70 │
│ 0.03237 0.00 2.180 0 0.4580 6.9980 45.80 6.0622 3 222.0 18.70 394.63 2.94 33.40 │
│ 0.06905 0.00 2.180 0 0.4580 7.1470 54.20 6.0622 3 222.0 18.70 396.90 5.33 36.20 │
│ 0.02985 0.00 2.180 0 0.4580 6.4300 58.70 6.0622 3 222.0 18.70 394.12 5.21 28.70 │
└──────────────────────────────────────────────────────────────────────────────────────────────────┘
|
Or download first, then get file path:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
|
SELECT gaggle_download('vikrishnan/boston-house-prices');
┌─────────────────────────────────────────────────────────────────────┐
│ gaggle_download('vikrishnan/boston-house-prices') │
│ varchar │
├─────────────────────────────────────────────────────────────────────┤
│ /home/sspaeti/.cache/gaggle/datasets/vikrishnan/boston-house-prices │
└─────────────────────────────────────────────────────────────────────┘
SELECT gaggle_file_path('vikrishnan/boston-house-prices', 'housing.csv');
┌─────────────────────────────────────────────────────────────────────────────────┐
│ gaggle_file_path('vikrishnan/boston-house-prices', 'housing.csv') │
│ varchar │
├─────────────────────────────────────────────────────────────────────────────────┤
│ /home/sspaeti/.cache/gaggle/datasets/vikrishnan/boston-house-prices/housing.csv │
└─────────────────────────────────────────────────────────────────────────────────┘
|
Checking version:
1
2
3
4
5
6
7
|
SELECT gaggle_version();
┌──────────────────┐
│ gaggle_version() │
│ varchar │
├──────────────────┤
│ 0.1.0-alpha.3 │
└──────────────────┘
|
Origin: Kaggle examples
References: Change Data Capture (CDC) - Real-Estate Project, Updating my Open-Source Real-Estate Data Engineering Project
Created 2025-12-02