3

How to parse Wikipedia XML with PHP? I tried it with simplepie, but I got nothing. Here is a link which I want to get its data.

http://en.wikipedia.org/w/api.php?action=query&generator=allpages&gaplimit=2&gapfilterredir=nonredirects&gapfrom=Re&prop=revisions&rvprop=content&format=xml

Edit code:

<?php
    define("EMAIL_ADDRESS", "youlichika@hotmail.com"); 
    $ch = curl_init(); 
    $cv = curl_version(); 
    $user_agent = "curl ${cv['version']} (${cv['host']}) libcurl/${cv['version']} ${cv['ssl_version']} zlib/${cv['libz_version']} <" . EMAIL_ADDRESS . ">"; 
    curl_setopt($ch, CURLOPT_USERAGENT, $user_agent); 
    curl_setopt($ch, CURLOPT_COOKIEFILE, "cookies.txt"); 
    curl_setopt($ch, CURLOPT_COOKIEJAR, "cookies.txt"); 
    curl_setopt($ch, CURLOPT_ENCODING, "deflate, gzip, identity"); 
    curl_setopt($ch, CURLOPT_HEADER, FALSE); 
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); 
    curl_setopt($ch, CURLOPT_HTTPGET, TRUE); 
    curl_setopt($ch, CURLOPT_URL, "http://en.wikipedia.org/w/api.php?action=query&generator=allpages&gaplimit=2&gapfilterredir=nonredirects&gapfrom=Re&prop=revisions&rvprop=content&format=xml"); 
    $xml = curl_exec($ch); 
    $xml_reader = new XMLReader(); 
    $xml_reader->xml($xml, "UTF-8"); 
    echo $xml->api->query->pages->page->rev;
?>
Damjan Pavlica
  • 31,277
  • 10
  • 71
  • 76
yuli chika
  • 9,053
  • 20
  • 75
  • 122

3 Answers3

7

I generally use a combination of CURL and XMLReader to parse XML generated by the MediaWiki API.

Note that you must include your e-mail address in the User-Agent header, or else the API script will respond with HTTP 403 Forbidden.

Here is how I initialize the CURL handle:

define("EMAIL_ADDRESS", "my@email.com");
$ch = curl_init();
$cv = curl_version();
$user_agent = "curl ${cv['version']} (${cv['host']}) libcurl/${cv['version']} ${cv['ssl_version']} zlib/${cv['libz_version']} <" . EMAIL_ADDRESS . ">";
curl_setopt($ch, CURLOPT_USERAGENT, $user_agent);
curl_setopt($ch, CURLOPT_COOKIEFILE, "cookies.txt");
curl_setopt($ch, CURLOPT_COOKIEJAR, "cookies.txt");
curl_setopt($ch, CURLOPT_ENCODING, "deflate, gzip, identity");
curl_setopt($ch, CURLOPT_HEADER, FALSE);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);

You can then use this code which grabs the XML and constructs a new XMLReader object in $xml_reader:

curl_setopt($ch, CURLOPT_HTTPGET, TRUE);
curl_setopt($ch, CURLOPT_URL, "http://en.wikipedia.org/w/api.php?action=query&generator=allpages&gaplimit=2&gapfilterredir=nonredirects&gapfrom=Re&prop=revisions&rvprop=content&format=xml");
$xml = curl_exec($ch);
$xml_reader = new XMLReader();
$xml_reader->xml($xml, "UTF-8");

EDIT: Here is a working example:

<?php
define("EMAIL_ADDRESS", "youlichika@hotmail.com");
$ch = curl_init();
$cv = curl_version();
$user_agent = "curl ${cv['version']} (${cv['host']}) libcurl/${cv['version']} ${cv['ssl_version']} zlib/${cv['libz_version']} <" . EMAIL_ADDRESS . ">"; 
curl_setopt($ch, CURLOPT_USERAGENT, $user_agent);
curl_setopt($ch, CURLOPT_COOKIEFILE, "cookies.txt");
curl_setopt($ch, CURLOPT_COOKIEJAR, "cookies.txt");
curl_setopt($ch, CURLOPT_ENCODING, "deflate, gzip, identity");
curl_setopt($ch, CURLOPT_HEADER, FALSE);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_HTTPGET, TRUE);
curl_setopt($ch, CURLOPT_URL, "http://en.wikipedia.org/w/api.php?action=query&generator=allpages&gaplimit=2&gapfilterredir=nonredirects&gapfrom=Re&prop=revisions&rvprop=content&format=xml"); 
$xml = curl_exec($ch);
$xml_reader = new XMLReader();
$xml_reader->xml($xml, "UTF-8");

function extract_first_rev(XMLReader $xml_reader)
{
    while ($xml_reader->read()) {
        if ($xml_reader->nodeType == XMLReader::ELEMENT) {
            if ($xml_reader->name == "rev") {
                $content = htmlspecialchars_decode($xml_reader->readInnerXML(), ENT_QUOTES);
                return $content;
            }
        } else if ($xml_reader->nodeType == XMLReader::END_ELEMENT) {
            if ($xml_reader->name == "page") {
                throw new Exception("Unexpectedly found `</page>`");
            }
        }
    }

    throw new Exception("Reached the end of the XML document without finding revision content");
}

$latest_rev = array();
while ($xml_reader->read()) {
    if ($xml_reader->nodeType == XMLReader::ELEMENT) {
        if ($xml_reader->name == "page") {
            $latest_rev[$xml_reader->getAttribute("title")] = extract_first_rev($xml_reader);
        }
    }
}

function parse($rev)
{
    global $ch;

    curl_setopt($ch, CURLOPT_HTTPGET, TRUE);
    curl_setopt($ch, CURLOPT_URL, "http://en.wikipedia.org/w/api.php?action=parse&text=" . rawurlencode($rev) . "&prop=text&format=xml");
    sleep(3);
    $xml = curl_exec($ch);
    $xml_reader = new XMLReader();
    $xml_reader->xml($xml, "UTF-8");

    while ($xml_reader->read()) {
        if ($xml_reader->nodeType == XMLReader::ELEMENT) {
            if ($xml_reader->name == "text") {
                $html = htmlspecialchars_decode($xml_reader->readInnerXML(), ENT_QUOTES);
                return $html;
            }
        }
    }

    throw new Exception("Failed to parse");
}

foreach ($latest_rev as $title => $latest_rev) {
    echo parse($latest_rev) . "\n";
}
Daniel Trebbien
  • 38,421
  • 18
  • 121
  • 193
  • thanks, @Daniel Trebbien, it it mind if i use `curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 6.1; he; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8");` for a `$user_agent`. – yuli chika Jan 29 '11 at 23:18
  • and the data back like `{{Wiktionary|re|re-}} '''RE''' may mean: * RE ([[:nl:EDP-Auditing|Register EDP auditor]]), Electronic Data Processing auditor, also IT auditor * [[RE (complexity)]], the set of recursively enumerable languages *... ` it is not html format... – yuli chika Jan 29 '11 at 23:19
  • @yuli: That might work, but the system administrators do not like that. You need to include your e-mail address. See: https://secure.wikimedia.org/wikipedia/meta/wiki/User-Agent_policy – Daniel Trebbien Jan 29 '11 at 23:20
  • 1
    @yuli: When you query revisions, the contents are in MediaWiki markup form. If you need to convert MediaWiki content to HTML, use the `parse` action of the API. For example: [https://secure.wikimedia.org/wikipedia/en/w/api.php?action=parse&text={{Project:Sandbox}}&format=xml](https://secure.wikimedia.org/wikipedia/en/w/api.php?action=parse&text={{Project:Sandbox}}&format=xml) – Daniel Trebbien Jan 29 '11 at 23:29
  • @Daniel Trebbien, ok, i have replaced my e-mail for a web browser. but this time, there is nothing return. I paste my code in my post part. and the return data is not a html format, do u have idea how to exchage it? thanks again. – yuli chika Jan 29 '11 at 23:31
  • @Daniel Trebbien, I tried add `action=parse` ,but it return `Unrecognized value for parameter 'prop': revisions` – yuli chika Jan 29 '11 at 23:39
1

You could use simplexml:

$xml = simplexml_load_file($url);

See example here: http://php.net/manual/en/simplexml.examples-basic.php

Or Dom:

$xml = new DomDocument;
$xml->load($url);

Or XmlReader for huge XML documents that you don't want to read entirely in memory.

Arnaud Le Blanc
  • 98,321
  • 23
  • 206
  • 194
  • When using the MediaWiki API, you can't just call `simplexml_load_file` to retrieve the XML because the response will be HTTP 403 Forbidden. The API script blocks requests that do not include contact information in the `User-Agent` header. – Daniel Trebbien Jan 29 '11 at 23:02
  • 1
    @user576875, @LadaRaider, `Warning: DOMDocument::load(http://en.wikipedia.org/w/api.php?action=query&generator=allpages&gaplimit=2&gapfilterredir=nonredirects&gapfrom=Re&prop=revisions&rvprop=content&format=xml) [domdocument.load]: failed to open stream: HTTP request failed! HTTP/1.0 403 Forbidden` how to add a `$user_agent` like @Daniel Trebbien gave? – yuli chika Jan 29 '11 at 23:06
  • 2
    @yuli: Add `ini_set("user_agent", EMAIL_ADDRESS);` before you call `simplexml_load_file`. – Daniel Trebbien Jan 29 '11 at 23:11
1

You should look at the php XMLReader class.

Dalmas
  • 26,409
  • 9
  • 67
  • 80