3

I am trying to use the Wikipedia API to get all links on all pages. Currently I'm using

https://en.wikipedia.org/w/api.php?format=json&action=query&generator=alllinks&prop=links&pllimit=max&plnamespace=0

but this does not seem to start at the first article and end at the last. How can I get this to generate all pages and all their links?

dangee1705
  • 3,445
  • 1
  • 21
  • 40

2 Answers2

6

The English Wikipedia has approximately 1.05 billion internal links. Considering the list=alllinks module has a limit of 500 links per request, it's not realistic to get all links from the API.

Instead, you can download Wikipedia's database dumps and use those. Specifically, you want the pagelinks dump, containing information about the links themselves, and very likely also the page dump, for mapping page ids to page titles.

svick
  • 236,525
  • 50
  • 385
  • 514
2

I know this is an old question, but in case anyone else is searching and finds this, I highly recommend looking at Wikicrush to extract the link graph for all of Wikipedia. It produces a relatively compact representation that can be used to very quickly traverse links.

jkraybill
  • 3,339
  • 27
  • 32