3

I need to get the first image/ main image in any given wiki page. I could use a scraping tool to do this. But I am using curl to scrap a page. But may be due to slow internet connection, it is taking a long time to scrap just one wiki page. Apart from that I need to display at least 7-8 different wiki images at the same time depending on user's query.

So no point in using curl for this. I tried wiki api

https://en.wikipedia.org/w/api.php?action=query&titles=India&prop=images&imlimit=1

But there are no other parameters that I can give to sort this list. Usually the first image this api is returning is not the main image which you see at the top of the page. Sometimes the image is too far from the context of the page.

I need to display just one image for each wiki title. Thanks in advance.

James Hay
  • 7,115
  • 6
  • 33
  • 57
Krishna Deepak
  • 1,735
  • 2
  • 20
  • 31
  • 1
    Hmm, have you taken a look into the API which other ways are possible? There are normally more options than this. – hakre Apr 20 '12 at 14:53
  • Do you really mean any wiki page? Or are you limiting your requirements to wikimedia wikis (as per the tag)? Or are you limiting your requirements to wikipedia (as per the example)? – Quentin Apr 20 '12 at 14:56

4 Answers4

6

To get often-times a very good guess for the "main image", use prop=pageimages, provided by the MediaWiki extension "PageImages":

The PageImages extension collects information about images used on a page.

Its aim is to return the single most appropriate thumbnail associated with an article, attempting to return only meaningful images, e.g. not those from maintenance templates, stubs or flag icons. Currently it uses the first non-meaningless image used in the page.

(Text is cc-by-sa 3.0; list of authors)

Usage

To quote from the MediaWiki API documentation:

Returns information about images on the page, such as thumbnail and
presence of photos.
Parameters:

piprop
    Which information to return:

    thumbnail
        URL and dimensions of image associated with page, if any.
    name
        Image title.

    Values (separate with "|"): thumbnail, name
    Default: thumbnail|name

pithumbsize
    Maximum thumbnail dimension. 
    Default: 50

pilimit
    Properties of how many pages to return. 
    No more than 50 (100 for bots) allowed.
    Default: 1

picontinue
    When more results are available, use this to continue. 

Example

https://en.wikipedia.org/w/api.php?action=query&titles=India&prop=pageimages&pithumbsize=300

Return value:

{
    "query": {
        "pages": {
            "14533": {
                "pageid": 14533,
                "ns": 0,
                "title": "India",
                "thumbnail": {
                    "source": "https://upload.wikimedia.org/wikipedia/commons/thumb/b/b8/Political_map_of_India_EN.svg/256px-Political_map_of_India_EN.svg.png",
                    "width": 256,
                    "height": 300
                },
                "pageimage": "Political_map_of_India_EN.svg"
            }
        }
    }
}

Further examples:

Rainer Rillke
  • 1,281
  • 12
  • 24
  • Just wanted to drop a quick note to say thank you - your post just helped me figure something out. – grpcMe Jun 07 '16 at 11:58
3
api.php?action=query&titles=India&prop=images

Gives you the full list of all images sorted alphabetically. You can retrieve the first image from the document order on the non-api page. Probably if you combine both, you'll get most out of it:

$topic = 'India';
$url = sprintf('http://en.wikipedia.org/wiki/%s', urlencode($topic));
$options = array(
    'http' => array(
        'user_agent' => 'Mozilla/5.0 (iPad; U; CPU OS 3_2 like Mac OS X; en-us) AppleWebKit/531.21.10 (KHTML, like Gecko) Version/4.0.4 Mobile/7B334b Safari/531.21.102011-10-16 20:23:50',
    )
);
$context = stream_context_create($options);
libxml_set_streams_context($context);
$doc = new DOMDocument();
$doc->loadHTMLFile($url);
$xp = new DOMXPath($doc);
$result = $xp->query('(//img[@class = "thumbimage"])[1]');
$image = ($result && $result->length) ? $result->item(0) : NULL;
echo $doc->saveXML($image), "\n";
hakre
  • 193,403
  • 52
  • 435
  • 836
  • as I said i don't want to scrap from pages as they are bulky and i have to do as many as 10 page in an instant. Please read first 2 lines of my question. – Krishna Deepak Apr 21 '12 at 11:02
  • Bulky? Normally they are quickly accessed and parsing is like a breeze thanks to DOMDocument. Also you can do caching. – hakre Apr 21 '12 at 11:11
  • im getting some error with this script. Also I am getting the whole page, when i changed the class to 'image'. The error which i am getting every time is this Warning: DOMDocument::loadHTMLFile(): ID CITEREFInternational_Monetary_Fund_2011 already defined in http://en.wikipedia.org/wiki/India, line: 1325 in /var/www/wiki.php on line 18 Warning: DOMDocument::loadHTMLFile(): ID CITEREFKuiper2010 already defined in http://en.wikipedia.org/wiki/India, line: 1395 in /var/www/wiki.php on line 18 . Dont mind the line numbers – Krishna Deepak Apr 21 '12 at 12:09
  • You are getting warnings about duplicate IDs probably, that's because the HTMl source is not with the standards, use [`libxml_use_internal_errors`](http://php.net/libxml_use_internal_errors) to control that behavior. – hakre Apr 21 '12 at 13:07
2

Seems like the images are getting returned in alphabetical order.... weird.

Anyway, this might work better:

https://en.wikipedia.org/w/api.php?action=parse&text={{Barack_Obama}}&prop=images

Unfortunately, only the first image is usable, but at least it's the right one.

Jonny Burger
  • 922
  • 4
  • 11
0
$wikipage = file_get_contents('http://en.wikipedia.org/wiki/Cats');
preg_match_all('/<img[^<]+?>/', $wikipage, $matches);

typically the main image will be the second match, after the lock (http://upload.wikimedia.org/wikipedia/commons/thumb/f/fc/Padlock-silver.svg/20px-Padlock-silver.svg.png)

squarephoenix
  • 1,003
  • 7
  • 7
  • as I said i don't want to scrap from pages as they are bulky and i have to do as many as 10 page in an instant. Please read first 2 lines of my question. – Krishna Deepak Apr 20 '12 at 17:12