Finding all pages containing images in Wikimedia Commons category via API

Question

I'm currently trying to find all the pages where images/media from a particular category are being used on Wikimedia Commons.

Using the API, I can list all the images with no problem, but I'm struggling to make the query add in all the pages where the items are used.

Here is an example category with only two media images https://commons.wikimedia.org/wiki/Category:Automobiles

Here is the API call I am using

https://commons.wikimedia.org/w/api.php?action=query&prop=images&format=json&generator=categorymembers&gcmtitle=Category%3AAutomobiles&gcmprop=title&gcmnamespace=6&gcmlimit=200&gcmsort=sortkey

The long term aim is to find all the pages the images from our collections appear on and then get all the tags from those pages about the images. We can then use this to enhance our archive of information about those images and hopefully used linked data to find relevant images we may not know about from DBpedia.

I might have to do two queries, first get the images then request info about each page, but I was hoping to do it all in one call.

score 0 · Answer 1 · answered Apr 27 '15 at 19:10

I don't understand your use case ("our collections"?) so I don't know why you want to use the API directly, but if you want to recurse in categories you're going to do a lot of wheel reinvention.

Most people use the tools made by Magnus Manske, creator of MediaWiki: in this case it's GLAMourous. Example with 3 levels of recursion (finds 186k images, 114k usages): https://tools.wmflabs.org/glamtools/glamorous.php?doit=1&category=Automobiles&use_globalusage=1&depth=3

Results can also be downloaded in XML format, so it's machine-readable.

score 0 · Answer 2 · answered Jun 26 '15 at 11:24

Assuming that you don't need to recurse into subcategories, you can just use a prop=globalusage query with generator=categorymembers, e.g. like this:

https://commons.wikimedia.org/w/api.php?action=query&prop=globalusage&generator=categorymembers&gcmtitle=Category:Images_from_the_German_Federal_Archive&gcmtype=file&gcmlimit=200&continue=

The output, in JSON format, will looks something like this:

// ...snip...
"6197351": {
    "pageid": 6197351,
    "ns": 6,
    "title": "File:-Bundesarchiv Bild 183-1987-1225-004, Schwerin, Thronsaal-demo.jpg",
    "globalusage": [
        {
            "title": "Wikipedia:Fotowerkstatt/Archiv/2009/M\u00e4rz",
            "wiki": "de.wikipedia.org",
            "url": "https://de.wikipedia.org/wiki/Wikipedia:Fotowerkstatt/Archiv/2009/M%C3%A4rz"
        }
    ]
},
"6428927": {
    "pageid": 6428927,
    "ns": 6,
    "title": "File:-Fernsehstudio-Journalistengespraech-crop.jpg",
    "globalusage": [
        {
            "title": "Kurt_von_Gleichen-Ru\u00dfwurm",
            "wiki": "de.wikipedia.org",
            "url": "https://de.wikipedia.org/wiki/Kurt_von_Gleichen-Ru%C3%9Fwurm"
        },
        {
            "title": "Wikipedia:Fotowerkstatt/Archiv/2009/April",
            "wiki": "de.wikipedia.org",
            "url": "https://de.wikipedia.org/wiki/Wikipedia:Fotowerkstatt/Archiv/2009/April"
        }
    ]
},
// ...snip...

Note that you will very likely have to deal with query continuations, since there may easily be more results than MediaWiki will return in a single request. See the linked page for more information on handling those (or just use an MW API client that handles them for you).

Finding all pages containing images in Wikimedia Commons category via API

2 Answers2