Search

Search IconIcon to open search

Kaggle

Last updatedUpdated: by Simon Späti · CreatedCreated: · 2 min read

Kaggle is the world’s largest Data Science community with powerful tools and resources to help you achieve your data science goals.

# Datasets

See datasets in Find good Data Sets or Sources:

!Find good Data Sets or Sources

# Using DuckDB to Read Datasets with Extension

Gaggle is a DuckDB extension for working with Kaggle datasets. TL;DR: You can directly access and query Kaggle datasets with kaggle: as if they were local tables, abstracting away the download, extraction, and caching process.

See how this works with example. First setup API keys with - Go to kaggle.com/settings API section → "Create New Token" and export it:

1
2
export KAGGLE_USERNAME="your_username"
export KAGGLE_KEY="your_api_key"

Then start DuckDB and install the extension:

1
2
install gaggle from community;
load gaggle;

List boston house prices (this dataset here: Boston House Prices):

1
2
3
4
5
6
7
8
-- List files (no recursive parameter in released version)
SELECT * FROM gaggle_ls('vikrishnan/boston-house-prices');
┌─────────────┬───────┬─────────────────────────────────────────────────────────────────────────────────┐
    name      size                                        path                                       
   varchar    int64                                      varchar                                     
├─────────────┼───────┼─────────────────────────────────────────────────────────────────────────────────┤
 housing.csv    0    /home/sspaeti/.cache/gaggle/datasets/vikrishnan/boston-house-prices/housing.csv 
└─────────────┴───────┴─────────────────────────────────────────────────────────────────────────────────┘

Query dataset directly:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
SELECT * FROM 'kaggle:vikrishnan/boston-house-prices/housing.csv' LIMIT 5;
┌──────────────────────────────────────────────────────────────────────────────────────────────────┐
 0.00632  18.00   2.310  0  0.5380  6.5750  65.20  4.0900   1  296.0  15.30 396.90   4.98  24.00  
                                             varchar                                              
├──────────────────────────────────────────────────────────────────────────────────────────────────┤
  0.02731   0.00   7.070  0  0.4690  6.4210  78.90  4.9671   2  242.0  17.80 396.90   9.14  21.60 
  0.02729   0.00   7.070  0  0.4690  7.1850  61.10  4.9671   2  242.0  17.80 392.83   4.03  34.70 
  0.03237   0.00   2.180  0  0.4580  6.9980  45.80  6.0622   3  222.0  18.70 394.63   2.94  33.40 
  0.06905   0.00   2.180  0  0.4580  7.1470  54.20  6.0622   3  222.0  18.70 396.90   5.33  36.20 
  0.02985   0.00   2.180  0  0.4580  6.4300  58.70  6.0622   3  222.0  18.70 394.12   5.21  28.70 
└──────────────────────────────────────────────────────────────────────────────────────────────────┘

Or download first, then get file path:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
SELECT gaggle_download('vikrishnan/boston-house-prices');
┌─────────────────────────────────────────────────────────────────────┐
          gaggle_download('vikrishnan/boston-house-prices')          
                               varchar                               
├─────────────────────────────────────────────────────────────────────┤
 /home/sspaeti/.cache/gaggle/datasets/vikrishnan/boston-house-prices 

└─────────────────────────────────────────────────────────────────────┘

SELECT gaggle_file_path('vikrishnan/boston-house-prices', 'housing.csv');
┌─────────────────────────────────────────────────────────────────────────────────┐
        gaggle_file_path('vikrishnan/boston-house-prices', 'housing.csv')        
                                     varchar                                     
├─────────────────────────────────────────────────────────────────────────────────┤
 /home/sspaeti/.cache/gaggle/datasets/vikrishnan/boston-house-prices/housing.csv 
└─────────────────────────────────────────────────────────────────────────────────┘

Checking version:

1
2
3
4
5
6
7
SELECT gaggle_version();
┌──────────────────┐
 gaggle_version() 
     varchar      
├──────────────────┤
 0.1.0-alpha.3    
└──────────────────┘

Origin: Kaggle examples
References: Change Data Capture (CDC) - Real-Estate Project, Updating my Open-Source Real-Estate Data Engineering Project