0

I am trying to find the main image link(Usually, the Infobox one) from a page within cirrussearch wikipedia dump. I am able to get it by using wikipedia API but its too much overhead for Wikipedia server to get it for all wikipedia pages. As a matter of fact, I'd like to get it from an offline dump.

There is an interesting stackoverflow post to generate the link from the image name(md5 on the name), and append the result to the domain https://upload.wikimedia.org/wikipedia/commons/ Unfortunately, it does not work for all images. Example for Bouygues Telecom where the image path is not on wikipedia/commons but on wikipedia/fr

I also tried to get it from http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-imagelinks.sql.gz but it does not help me.

Do you know if there is any possibility to get the main image link of a page from a dump?

Community
  • 1
  • 1
Florent Valdelievre
  • 1,546
  • 3
  • 20
  • 32
  • I don't think you can. If you don't want to use the API (the servers will be fine - see e.g. [this thread](https://www.mail-archive.com/search?l=multimedia%40lists.wikimedia.org&q=subject:%22%5C%5BMultimedia%5C%5D+%5C%5BCommons%5C-l%5C%5D+Hashing+Wikimedia+Commons%22&o=newest&f=1) for past discussion - but it would take a while), your best bet is probably parsing the HTML dumps. – Tgr Mar 23 '16 at 22:02
  • Actually, even when parsing HTML dumps, we can find images ( search by extensions ) but I cant generate image link with a good reliability( Sometime it is in wikipedia/commons sometimes in wikipedia/{lang} ) – Florent Valdelievre Mar 24 '16 at 08:37
  • If you can find the image link in the HTML dump, why would you need to generate anything? Maybe you could clarify in the question what you are after. – Tgr Mar 24 '16 at 11:23
  • We can find image like '30C3_Commons_Machinery_2.jpg', but no path or domain. We don't know where they are hosted – Florent Valdelievre Mar 24 '16 at 13:49
  • 1
    You are probably looking at something other than the HTML dumps then. Try http://dumps.wikimedia.org/other/kiwix/zim/wikipedia/. – Tgr Mar 24 '16 at 17:29
  • You need to check existence of the image on the local wiki, and then if the server returns 404, against commons (which for historical reasons still has the 'wikipedia/' prefix, I think it used to be commons.wikipedia.org before the non-wikipedia projects came along or something). You can probably use HEAD requests for this, no need to download each image. Alternatively, I think the other dump files will contain indicates of the existence of the images. Note the two URL parts after the site is the first character and first two characters of the MD5 hash of the filename. – Krenair Nov 06 '16 at 07:23

0 Answers0