Find good Data Sets or Sources
Data sets are always useful for data engineering projects. Here are some listed that can help find one.
# Data Sets
Some that offer great datasets:
- Kaggle Datasets: Kaggle is a platform for data science and machine learning competitions, and it hosts a wide variety of datasets. You can find datasets on various topics, such as finance, sports, healthcare, and more. Browse through their dataset collection at https://www.kaggle.com/datasets.
Google Dataset Search
- they are now (2023-03-05) included in default serach Datasets at your fingertips in Google Search – Google AI Blog
- If you need a big one, the New York Taxi is a famous one. They recently added the much better apache parquet format (but CSV is still there as well):
- less relevant for you, but here in Switzerland we have some open-data endeavors that try to open-source all data, e.g.
- The Seattle Data Guy also has great resources, e.g. here he suggests 5 of them: 5 Data Sources for Your Data Engineering Projects - Data Engineering Portfolio Part (1/5) - YouTube.
- Weekly newsletter of useful/curious datasets: Data Is Plural
- UCI Machine Learning Repository: The UCI Machine Learning Repository is a collection of databases, domain theories, and data generators used by the machine learning community for research purposes. It contains datasets related to various domains, including text, image, and time-series data. Visit the repository at https://archive.ics.uci.edu/ml/index.php.
- Awesome Public Datasets: This is a GitHub repository that contains a curated list of high-quality public datasets on various topics, such as agriculture, biology, climate, economics, education, and more. You can find the list at https://github.com/awesomedata/awesome-public-datasets.
- FiveThirtyEight: FiveThirtyEight is a website that focuses on opinion poll analysis, politics, economics, and sports blogging
- CERN Open Data Portal
- public-apis/public-apis: A collective list of free APIs
- spark-joy/README.md at master · sw-yx/spark-joy · GitHub