3

Given any page on Wikipedia, such as the one for Coffee, I'm trying to figure out how to extract a list of all references (including any metadata) on the page. At first glance it seems this would be easy, since most pages list them all under a section called References. However, when you examine the wikitext of those pages you find that References is just a pointer to the ref template, which I believe generates them dynamically from all of the entries throughout the text on the page.

When I examine the wikitext from sections of text that are connected to each reference, I find that they are enclosed in <ref></ref> tags. The content between these tags is dependent on citation type.

So one strategy would be to query all content of the page and do my own parsing to find all <ref></ref> pairs. However, I'm thinking there must be a way to do this within the Mediawiki API that I'm not finding. Is there a way? I'd rather pull all of this from wikitext or something other than the final HTML as I expect the former would be more stable.

mix
  • 6,943
  • 15
  • 61
  • 90
  • I don't know why you've been downvoted - I'd also have thought that there'd be a mediawiki/wikipedia api for getting reference/citation data. –  Aug 23 '18 at 12:43

2 Answers2

3

I don't know what information exactly you are looking for in the <ref>'s, but if you need only the external links, you can really use MediaWiki API with action "parse":

https://en.wikipedia.org/w/api.php?action=parse&page=Coffee&prop=externallinks
Termininja
  • 6,620
  • 12
  • 48
  • 49
1

There are tools that are able to handle the wikipedia xml format:

This is a post on some of hte tools for handling wikipedia dumps: http://engineering.idioplatform.com/2016/02/18/wikipedia-toolkit.html

Another posibiliy (probably even easier) is to use wikidata:

David Przybilla
  • 830
  • 6
  • 16