Search
Find Good Data Sets or Sources
Data sets are always useful for data engineering projects. Here are some listed that can help find one.
# Data Sets
Some that offer great datasets:
-
Kaggle Datasets: A platform for data science and machine learning competitions hosting a wide variety of datasets on topics like finance, sports, healthcare, and more. ^0628a8
- House prices on Kaggle-com
- Use in connection with DuckDB with the extension Gaggle: Working with Kaggle datasets. Kaggle
- Google Dataset Search: Google’s tool for finding datasets across the web, now included in default search as of 2023-03-05. Learn more at the Google AI Blog
- New York Taxi Dataset: The well-known NYC taxi datasets, recently updated with the much better Apache Parquet format (CSV is still available as well) - see also NYC Taxi Dataset
- OpenData: Open-data initiatives that try to open-source all data.
- CH: OpenData Swiss: Less relevant unless you’re in Switzerland.
- US: Data.gov: The US government’s open data repository containing datasets from various agencies and organizations, covering topics including agriculture, climate, education, energy, and more.
- SeattleDataGuy: Great resources for data engineering projects, including a video suggesting 5 data sources.
-
Data Is Plural: A weekly newsletter of useful and curious datasets with a Markdown archive and RSS/Atom feeds.
- Find their GitHub
- Or as Google Sheet
- UCI Machine Learning Repository: A collection of databases, domain theories, and data generators used by the machine learning community for research purposes, containing datasets related to various domains including text, image, and time-series data.
- Awesome Public Datasets: A GitHub repository containing a curated list of high-quality public datasets on various topics such as agriculture, biology, climate, economics, education, and more.
- FiveThirtyEight: A website that focuses on opinion poll analysis, politics, economics, and sports blogging with datasets available.
- CERN Open Data Portal: Open data from CERN’s research and experiments.
- NOAA Weather data: Weather and climate data from the National Oceanic and Atmospheric Administration.
- Hugging Face Datasets: A collection of datasets for machine learning and AI research by Hugging Face.
- **Wikipedia Data sets
# Paid Datasets
- PUDL (Public Utility Data Liberation Project): Provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
# Real-Estate
# APIs
- public-apis: A collective list of free APIs.
- spark-joy Mock APIs: A collection of mock APIs and resources.
# Other Lists
- Open Data by David Gasquez: Dataset indexes and comprehensive information about open data. Generally about Open Data
- The Top 17 Places to Find Datasets: An article covering the best places to find datasets.
- ClickHouse Play: ClickHouse has ingested many datasets available for exploration.
# Further Reads
- Our World in Data: Research and data to make progress against the world’s largest problems.
- Open-Source Data Engineering Projects
Origin:
References:
Created 2022-05-17