You can use the revisions
property along with the rvgeneratexml
parameter to generate a parse tree for the article. Then you can apply XPath or traverse it and look for the desired information.
Here's an example code:
$page = 'Radiohead';
$api_call_url = 'http://en.wikipedia.org/w/api.php?action=query&titles=' .
urlencode( $page ) . '&prop=revisions&rvprop=content&rvgeneratexml=1&format=json';
You have to identify yourself to the API, see more on Meta Wiki.
$user_agent = 'Your name <your email>';
$curl = curl_init();
curl_setopt_array( $curl, array(
CURLOPT_RETURNTRANSFER => true,
CURLOPT_USERAGENT => $user_agent,
CURLOPT_URL => $api_call_url,
) );
$response = json_decode( curl_exec( $curl ), true );
curl_close( $curl );
foreach( $response['query']['pages'] as $page ) {
$parsetree = simplexml_load_string( $page['revisions'][0]['parsetree'] );
Here we use XPath in order to find the Infobox musical artist
's parameter Origin
and its value. See the XPath specification for the syntax and such. You could as well traverse the tree and look for the nodes manually. Feel free to investigate the parse tree to get a better grip of it.
$infobox_origin = $parsetree->xpath( '//template[contains(string(title),' .
'"Infobox musical artist")]/part[contains(string(name),"Origin")]/value' );
echo trim( strval( $infobox_origin[0] ) );
}