How to extract all reference data from a Wikipedia page?

Question

Given any page on Wikipedia, such as the one for Coffee, I'm trying to figure out how to extract a list of all references (including any metadata) on the page. At first glance it seems this would be easy, since most pages list them all under a section called References. However, when you examine the wikitext of those pages you find that References is just a pointer to the ref template, which I believe generates them dynamically from all of the entries throughout the text on the page.

When I examine the wikitext from sections of text that are connected to each reference, I find that they are enclosed in <ref></ref> tags. The content between these tags is dependent on citation type.

So one strategy would be to query all content of the page and do my own parsing to find all <ref></ref> pairs. However, I'm thinking there must be a way to do this within the Mediawiki API that I'm not finding. Is there a way? I'd rather pull all of this from wikitext or something other than the final HTML as I expect the former would be more stable.

I don't know why you've been downvoted - I'd also have thought that there'd be a mediawiki/wikipedia api for getting reference/citation data. — , Aug 23 '18 at 12:43

score 3 · Answer 1 · answered Feb 20 '16 at 11:28

3

I don't know what information exactly you are looking for in the <ref>'s, but if you need only the external links, you can really use MediaWiki API with action "parse":

https://en.wikipedia.org/w/api.php?action=parse&page=Coffee&prop=externallinks

answered Feb 20 '16 at 11:28

Termininja

6,620
12
48
49

score 1 · Answer 2 · answered Feb 19 '16 at 16:29

There are tools that are able to handle the wikipedia xml format:

Sweble : https://github.com/sweble/sweble-wikitext
JWPL : https://github.com/dkpro/dkpro-jwpl
Jsonwikipedia: https://github.com/idio/json-wikipedia

This is a post on some of hte tools for handling wikipedia dumps: http://engineering.idioplatform.com/2016/02/18/wikipedia-toolkit.html

Another posibiliy (probably even easier) is to use wikidata:

https://www.wikidata.org/wiki/Q2068675 (it has the references)
if using wikidata you can probably find the references in the rdf triples or in the huge wikidata json dump
another possibility is checking wikidata api. i.e: https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q42 (references seem to be there)

How to extract all reference data from a Wikipedia page?

2 Answers2