3

Does MediaWiki provide a way to return the information present in 'Background Information' Table? (usually right of the article page) For example I would like to grab the Origin from Radiohead:

http://en.wikipedia.org/wiki/Radiohead

Or do I need to parse the html page?

srd.pl
  • 571
  • 10
  • 22

3 Answers3

4

You can use the revisions property along with the rvgeneratexml parameter to generate a parse tree for the article. Then you can apply XPath or traverse it and look for the desired information.

Here's an example code:

$page = 'Radiohead';
$api_call_url = 'http://en.wikipedia.org/w/api.php?action=query&titles=' .
    urlencode( $page ) . '&prop=revisions&rvprop=content&rvgeneratexml=1&format=json';

You have to identify yourself to the API, see more on Meta Wiki.

$user_agent = 'Your name <your email>';

$curl = curl_init();
curl_setopt_array( $curl, array(
    CURLOPT_RETURNTRANSFER => true,
    CURLOPT_USERAGENT => $user_agent,
    CURLOPT_URL => $api_call_url,
) );
$response = json_decode( curl_exec( $curl ), true );
curl_close( $curl );

foreach( $response['query']['pages'] as $page ) {
    $parsetree = simplexml_load_string( $page['revisions'][0]['parsetree'] );

Here we use XPath in order to find the Infobox musical artist's parameter Origin and its value. See the XPath specification for the syntax and such. You could as well traverse the tree and look for the nodes manually. Feel free to investigate the parse tree to get a better grip of it.

    $infobox_origin = $parsetree->xpath( '//template[contains(string(title),' .
        '"Infobox musical artist")]/part[contains(string(name),"Origin")]/value' );

    echo trim( strval( $infobox_origin[0] ) );
}
Matěj G.
  • 3,036
  • 1
  • 26
  • 27
  • The options you provided look interesting so thank's to Matej and hippietrai. I think at first I'd try the Xpath approach although I probably would need to implement this in Java. Again thanks to Matej and hippietrail. – srd.pl May 09 '11 at 11:12
  • Oh, I didn't realize you didn't mention any particular language, I'm sorry for that. – Matěj G. May 09 '11 at 13:46
1

MediaWiki as installed on Wikipedia provides no way to get this information (there are extensions such as Semantic MediaWiki that are designed for this sort of thing, but they are not installed on Wikipedia). You can either parse the output HTML or parse the page's wikitext, or in certain cases (e.g. birth/death year) you might be able to look at the page's categories via the API.

Anomie
  • 92,546
  • 13
  • 126
  • 145
  • that is unfortunate :/ This MediaWiki is getting more and more disappointing :/ But thanks for your answer. – srd.pl May 06 '11 at 11:06
1

It's a steep learning curve but DBpedia does what you want.

The "Background information table" you mention is called an "Infobox" in Wikipedia parlance and DBpedia allows very powerful queries on them. Unfortunately because it's powerful it's not easy to learn and I've mostly forgotten what I learned about it a year or two ago. I'll paste a query here though if I manage to learn it again (-:

In the meantime, here is DBpedia's idea of an introduction in how to use it.

This previous SO question will help: Getting DBPedia Infobox categories

UPDATE

OK here is the SPARQL query:

SELECT ?org
WHERE {
    <http://dbpedia.org/resource/Radiohead> dbpprop:origin ?org
}

Here is a URL where you can see it working and play with it.

And here is the output on that page: (you can get output in various formats too)

SPARQL results: org "Abingdon, Oxfordshire, England"@en

Community
  • 1
  • 1
hippietrail
  • 15,848
  • 18
  • 99
  • 158