Search
Wikipedia Data sets
from Wikimedia Downloads: Analytics, Pageviews: statistics compiled using the current Pageview Definition. Available as:
- Pageview complete: Our best effort to provide a comprehensive timeseries of per-article pageview data for Wikimedia projects. Data spans from December 2007 to the present with a uniform format and compression.
- Pageview/projectview data filtered to what we believe is only human traffic. Available since May 2015.
-
Pageview/projectview data, highly compressed and corrected for outages. This dataset was historically computed using the best source available at the time:
- 2015 - 2025:
Index of /other/pageviews/
- Dec 2015 - 2022:
https://dumps.wikimedia.your.org/other/pageviews/
- compressing and correcting: pageviews dataset
- Dec 2015 - 2022:
https://dumps.wikimedia.your.org/other/pageviews/
- 2007 - Dec 2015: compressing and correcting:
pagecounts-raw dataset
- 2007 to 2011: from pagecounts-raw (to be loaded during the second half of October 2020)
- 2011 to 2015: from pagecounts-ez
- 2015 - 2025:
Index of /other/pageviews/
# Others
-
Wikipedia Dumps: Format: XML, SQL dumps
- Link to ZIP: Index of /other/pageviews/
- Content: Full page content, revision history, user data, and more
- Updated: Typically twice monthly
- most commanly used:
- pages-articles.xml.bz2 - Current versions of article content (most popular for analysis)
- pages-meta-history.xml.bz2 - Full revision history (very large)
- page.sql.gz - Page metadata tables
- categorylinks.sql.gz - Category relationships
- Pre-processed Datasets
- Wikipedia Corpus on Hugging Face - Cleaned text datasets ready for ML/NLP
- DBpedia ( https://www.dbpedia.org/) - Structured data extracted from Wikipedia infoboxes
- Wikidata ( https://www.wikidata.org/) - Structured knowledge base with JSON dumps
- API: Load MediaWiki data in Python using dltHub
# Downloading
# Querying with DuckDB
|
|
# ClickHouse way
From WikiStat | ClickHouse Docs:
The dataset contains 0.5 trillion records.
- See the video from FOSDEM 2023: https://www.youtube.com/watch?v=JlcI2Vfz_uk
- And the presentation: https://presentations.clickhouse.com/fosdem2023/
- Data source: https://dumps.wikimedia.org/other/pageviews/
Getting the list of links:
|
|
How to loop Shell and get a list from urls YEARS-MONTH
Downloading the data:
|
|
(it will take about 3 days)
Creating a table:
|
|
Loading the data:
|
|
Or loading the cleaning data:
|
|
Origin: Find good Data Sets or Sources
References:
Created 2025-11-06