0

I have seen that there are various APIs and various tools that allow you to see the most visited pages of the Wikimedia projects such as Wikipedia, but all these services have a limit, they do not allow to show more than 1000 pages, while I would like to have the list of 5000-10000(or more) most visited pages in order of traffic.

these are all the services that I checked and with which I found this limit:

https://en.wikipedia.org/w/api.php?action=help&modules=query%2Bmostviewed

https://stats.wikimedia.org/#/en.wikipedia.org/reading/top-viewed-articles/normal|table|last-month|~total|monthly

https://pageviews.toolforge.org/topviews/?project=en.wikipedia.org&platform=all-access&date=last-month&excludes=

https://wikimedia.org/api/rest_v1/#/Pageviews%20data

I have also found services like https://quarry.wmflabs.org/ or https://query.wikidata.org/ where you can run a query, technically perhaps through this service you could but I don't know the query to be performed to show the pages with most visits.

I also found an interesting article here: https://www.reddit.com/r/bigquery/comments/3dg9le/analyzing_50_billion_wikipedia_pageviews_in_5/ where it is explained that it is possible to use Google's BigQuery but it is an external service and before using it I wanted to know if it existed a simpler method.

logi-kal
  • 7,107
  • 6
  • 31
  • 43
Overflow992
  • 31
  • 2
  • 5

1 Answers1

2

If the REST API doesn't suit your purpose, you'd need to parse the raw data yourself. That's because all the tools you've linked just consume the REST API.

The raw data are available at https://dumps.wikimedia.org/other/pageviews/. There are two groups of files there. One starts with pageviews-, which lists the number of views of individual pages, the second starts with projectviews-, which lists the number of views of individual projects.

For your target, you need the pageviews ones. Download the files for your timespan, and then analyze them using a script.

The file is space-separated. Each row represents one page that was visited in that hour. First column represents the project (en is English Wikipedia, for instance), second is the page title (spaces are represented by underscores) and then there are total pageviews.

The technical documentation is available at https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pageviews.

Martin Urbanec
  • 426
  • 4
  • 11
  • The Dump file is a good option, but it requires a lot of resources and time for management. I was looking for something that is easily manageable and with constant updates like the API or services mentioned. I found this guide https://cran.r-project.org/web/packages/pageviews/vignettes/Accessing_Wikimedia_pageviews.html where it is explained which query to use to extrapolate pages with multiple visits, I tried this query in https://query.wikidata.org/ but it doesn't work – Overflow992 Jul 04 '20 at 08:53
  • query.wikidata.org is for querying WIkidata - multilingual factual database. Unless I'm missing something, pageviews aren't stored in Wikidata. I'm afraid the dump files are your only option :/. The link you posted seems to be about consuming the REST api, but since that's limited to top 1000, it's not going to help :/. – Martin Urbanec Jul 04 '20 at 16:59