Reach a string behind unknown value in JSON

Question

I use Wikipedia's API to get information about a page. The API gives me JSON like this:

"query":{
  "pages":{
     "188791":{
        "pageid":188791,
        "ns":0,
        "title":"Vanit\u00e9",
        "langlinks":[
           {
              "lang":"bg",
              "*":"Vanitas"
           },
           {
              "lang":"ca",
              "*":"Vanitas"
           },
           ETC.
        }
     }
  }
}

You can see the full JSON response.

I want to obtain all entries like:

{
   "lang":"ca",
   "*":"Vanitas"
}

but the number key ("188791") in the pages object is the problem.

I found Find a value within nested json dictionary in python that explains me how to do enumerate the values.

Unfortunately I get the following exception:

TypeError: 'dict_values' object does not support indexing

My code is:

json["query"]["pages"].values()[0]["langlinks"]

It's probably a dumb question but I can't find a way to pass in the page id value.

Why pick only the first value? There is no ordering in a dictionary, so 'first' depends on various factors out of your control. What should happen if there is more than one entry in that dictionary? — Martijn Pieters, Nov 15 '13 at 21:29
Are you sure there will only be one page, or that you only want the first page (and remember, in both JSON and Python dictionaries, "first" is effectively random) if there are more than one? — abarnert, Nov 15 '13 at 21:29
@abarnert: If he queried only one page, he _should_ get data for only one page. Anything else would be a bug in MediaWiki. — Ilmari Karonen, Nov 15 '13 at 21:31
As a side note, calling your dictionary `json` is a very bad idea. That's the name of the module that you use to encode and decode JSON; once you've hidden it with a dictionary, you can't access the module anymore. — abarnert, Nov 15 '13 at 21:33
@IlmariKaronen: but he doesn't; take a look at the API link, it is a title search. — Martijn Pieters, Nov 15 '13 at 21:35
@MartijnPieters: The `titles` parameter in a MediaWiki API query [takes a `|`-separated list of exact page titles](https://www.mediawiki.org/wiki/API:Query#Specifying_pages). There's no `|` in the example query, so there will be at most one result. — Ilmari Karonen, Nov 15 '13 at 21:39
@IlmariKaronen: And you know that the code he wrote to generate that query is inserting a single `title` and not `'|'.join(titles)`? And that the same will be true for anyone who searches for this question in the future? And that none of them will ever want to use this code for something more general? The point remains that getting the "first value" out of a dictionary is a weird thing to do in general, and can be a problem with this particular API, and even if you know it won't be a problem with your current code, you should understand why your circumstances are special. — abarnert, Nov 15 '13 at 21:49

score 3 · Answer 1 · answered Nov 15 '13 at 22:57

One solution is to use the indexpageids parameter, e.g.: http://fr.wikipedia.org/w/api.php?action=query&titles=Vanit%C3%A9&prop=langlinks&lllimit=500&format=jsonfm&indexpageids. It will add an array of pageids to the response. You can then use that to access the dictionary.

score 2 · Accepted Answer · edited May 23 '17 at 11:50

2

As long as you're only querying one page at a time, Simeon Visser's answer will work. However, as a matter of good style, I'd recommend structuring your code so that you iterate over all the returned results, even if you know there should be only one:

for page in data["query"]["pages"].values():
    title = page["title"]
    langlinks = page["langlinks"]
    # do something with langlinks...

In particular, by writing your code this way, if you ever find yourself needing to run the query for multiple pages, you can do it efficiently with a single MediaWiki API request.

edited May 23 '17 at 11:50

Community

1
1

answered Nov 15 '13 at 21:51

Ilmari Karonen

49,047
9
93
153

@Oyabi: Ps. With no disrespect to abarnert intended, if you decided to use my solution, you might want to consider selecting it as the accepted one. Of course, if you don't, that's fine too -- it's your question and your choice. :) – Ilmari Karonen Nov 16 '13 at 23:01
Sorry I'm not very habistue with this site ;) – Oyabi Nov 17 '13 at 09:26

score 1 · Answer 3 · answered Nov 15 '13 at 21:25

You're using Python 3 and values() now returns a dict_values instead of a list. This is a view on the values of the dictionary.

Hence that's why you're getting that error because indexing fails. Indexing is possible in a list but not a view.

To fix it:

list(json["query"]["pages"].values())[0]["langlinks"]

abarnert · Answer 4 · 2013-11-18T18:28:33.450

1

If you really want just one page arbitrarily, do that the way Simeon Visser suggested.

But I suspect you want all langlinks in all pages, yes?

For that, you want a comprehension:

[page["langlinks"] for page in json["query"]["pages"].values()]

But of course that gives you a 2D list. If you want to iterate over each page's links, that's perfect. If you want to iterate over all of the langlinks at once, you want to flatten the list:

[langlink for page in json["query"]["pages"] 
 for langlink in page["langlinks"].values()]

… or…

itertools.chain.from_iterable(page["langlinks"] 
                              for page in json["query"]["pages"].values())

(The latter gives you an iterator; if you need a list, wrap the whole thing in list. Conversely, for the first two, if you don't need a list, just any iterable, use parens instead of square brackets to get a generator expression.)

edited Nov 18 '13 at 18:28

answered Nov 15 '13 at 21:31

abarnert

354,177
51
601
671

I try your solution but I've got this message **TypeError: string indices must be integers** My code are the following : `test= [langlink for page in json["query"]["pages"] for langlink in page["langlinks"]] print(test)` I use Ilmari Karonen solution but I want to know what I doing wrong. Thank you ;) – Oyabi Nov 16 '13 at 09:51
2

@Oyabi: I think you need to add `.values()` after `json["query"]["pages"]`, otherwise it iterates over the keys, which are strings. But I doubt you really want a flattened list of langlinks from multiple pages anyway; it seems like kind of a useless thing to query for. If you're sure you'll never want to fetch langlinks for more than one page at a time, you can use Simeon Visser's solution (preferably with a comment saying that, yes, there can only ever be one element in the dict), otherwise use mine. – Ilmari Karonen Nov 16 '13 at 14:36
Yep that's work. Thank you very much I have more understand comprehension list with your help. Have a nice day ;) – Oyabi Nov 16 '13 at 15:50
@IlmariKaronen: Thanks for catching the missing `values`. – abarnert Nov 18 '13 at 18:28
@IlmariKaronen: Meanwhile, as I explained in comments on the question, I don't think this is useless. If querying multiple pages were useless, the MediaWiki software wouldn't have a way to do it, and wouldn't return multiple pages in a response for even a single page. – abarnert Nov 18 '13 at 18:30
@abarnert: Querying multiple pages isn't useless. Merging together the langlinks lists for multiple pages, at least in my experience, is. (You're basically losing the context that tells you where the links are _from_.) I'm sure one could construct a use case for it, but I'm also at least 99% sure it's not what the OP actually wants. – Ilmari Karonen Nov 18 '13 at 22:33
1

@IlmariKaronen: Ah, I assumed you were making the same argument you made in the original comments to the question, not a different one that just sounds similar. Probably my fault for making that assumption and not reading carefully. – abarnert Nov 18 '13 at 22:37

Reach a string behind unknown value in JSON

4 Answers4

Linked