4

(First off, my apologies as this is a blatant cross-post. I thought opendata.SE would be the place for this, but it's gotten barely any views there and it appears to not be a very active site in general, so I figure I ought to try it here as it's programming-related.)

I'm trying to get a list of major cities in the world: their name, population, and location. I found what looked like a good query on Wikidata, slightly tweaking one of their built-in query examples:

SELECT DISTINCT ?cityLabel ?population ?gps WHERE {
  ?city (wdt:P31/wdt:P279*) wd:Q515.
  ?city wdt:P1082 ?population.
  ?city wdt:P625 ?gps.
  FILTER (?population >= 500000) .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
ORDER BY DESC(?population)

The results, at first glance, appear to be good, but it's missing a ton of important cities. For example, San Francisco (population 800,000+) and Seattle (population 650,000+) are not in the list, when I specifically asked for all cities with a population greater than 500,000.

Is there something wrong with my query? If not, there must be something wrong with the data Wikidata is using. Either way, how can I get a valid data set, with an API I can query from a Python script? (I've got the script all working for this; I'm just not getting back valid data.)

from SPARQLWrapper import SPARQLWrapper, JSON
from geopy.distance import great_circle

def parseCoords(gps):
    base = gps[6:-1]
    coords=base.split()
    return (float(coords[1]), float(coords[0]))

sparql = SPARQLWrapper("https://query.wikidata.org/sparql")
sparql.setReturnFormat(JSON)
sparql.setQuery("""SELECT DISTINCT ?cityLabel ?population ?gps WHERE {
  ?city (wdt:P31/wdt:P279*) wd:Q515.
  ?city wdt:P1082 ?population.
  ?city wdt:P625 ?gps.
  FILTER (?population >= 500000) .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
ORDER BY DESC(?population)""")
queryResults = sparql.query().convert()
cities = [(city["cityLabel"]["value"], int(city["population"]["value"]), parseCoords(city["gps"]["value"])) for city in queryResults["results"]["bindings"]]
print (cities)
Stanislav Kralin
  • 11,070
  • 4
  • 35
  • 58
Mason Wheeler
  • 82,511
  • 50
  • 270
  • 477
  • 1
    How many results do you get? It might be that the endpoint has a default limit - for instance on DBpedia you get at most 10000 entries, for more you have to use OFFSET + LIMIT aka. pagination. – UninformedUser May 31 '16 at 18:13
  • @AKSW 250, nowhere near any reasonable default limit. – Mason Wheeler May 31 '16 at 18:52

1 Answers1

3

The population of seattle is simply not in this database.

If you execute:

#Largest cities of the world
#defaultView:BubbleChart
SELECT * WHERE {
 wd:Q5083 wdt:P1082 ?population.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}

You get zero results. Altought the instance wd:Q5083(seattle) exists, it does not have a predicate wdt:P1082(population).

vds
  • 349
  • 1
  • 10
  • Interesting. It appears that [San Francisco](https://m.wikidata.org/wiki/Q62) does have population data, though, and the link is identified with the tag P1082. (In fact, if you follow that link, San Francisco is used as the example!) Any idea why it's not showing up, then? – Mason Wheeler May 31 '16 at 13:59
  • ...of course. Why didn't I think of that? \*eyeroll\* – Mason Wheeler May 31 '16 at 14:28
  • Well I noticed that there is an asterisk in you original query `wdt:P279*`. Which makes me think that maybe the fact that San Francisco is a city-county is not the issue. Because this asterisk should find everything that has a path by subclassof to city. Which a city-county has... – vds May 31 '16 at 14:38