0

I am trying to get a city list together with region and country information with a query like this:

# get a list of cities
# for geograpy3 library
# see https://github.com/somnathrakshit/geograpy3/issues/15
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX p: <http://www.wikidata.org/prop/>
PREFIX ps: <http://www.wikidata.org/prop/statement/>
PREFIX pq: <http://www.wikidata.org/prop/qualifier/>
# get human settlements
SELECT DISTINCT ?city ?cityLabel (max(?cityPop) as ?cityPopulation) ?coord ?region ?regionLabel ?regionIsoCode ?country ?countryLabel ?countryIsoCode ?countryPopulation ?countryGdpPerCapita WHERE {
  # if you uncomment this line this query might run for some 3 hours on a local wikidata copy using Apache Jena
  # run for Vienna, Illinois, Vienna Austria, Paris Texas and Paris France as example only
  # VALUES ?city { wd:Q577544 wd:Q1741 wd:Q830149 wd:Q90}.
  # run for Andorra
  VALUES ?country {wd:Q228}.
  # instance of human settlement https://www.wikidata.org/wiki/Q486972
  ?city wdt:P31/wdt:P279* wd:Q486972 .
  # label of the City
  ?city rdfs:label ?cityLabel filter (lang(?cityLabel) = "en").
  # country this city belongs to
  ?city wdt:P17 ?country .
  # label for the country
  ?country rdfs:label ?countryLabel filter (lang(?countryLabel) = "en").
  # https://www.wikidata.org/wiki/Property:P297 ISO 3166-1 alpha-2 code
  ?country wdt:P297 ?countryIsoCode.
  # population of country
  ?country wdt:P1082 ?countryPopulation.
  OPTIONAL {
     ?country wdt:P2132 ?countryGdpPerCapita.
  }
  OPTIONAL {
     # located in administrative territory
     # https://www.wikidata.org/wiki/Property:P131
     ?city wdt:P131* ?region.
     # administrative unit of first order
     ?region wdt:P31/wdt:P279* wd:Q10864048.
     ?region rdfs:label ?regionLabel filter (lang(?regionLabel) = "en").
     # isocode state/province
     OPTIONAL { ?region wdt:P300 ?regionIsoCode. }
  }
  # population of city
  OPTIONAL { ?city wdt:P1082 ?cityPop.}
   # get the coordinates
  OPTIONAL { ?city wdt:P625 ?coord. }
} GROUP BY  ?city ?cityLabel  ?coord ?region ?regionLabel ?regionIsoCode ?country ?countryLabel ?countryIsoCode ?countryPopulation ?countryGdpPerCapita
ORDER BY ?cityLabel

try it! to experiment with the query i comment out the

  # VALUES ?city { wd:Q577544 wd:Q1741 wd:Q830149 wd:Q90}.
  # run for Andorra
  VALUES ?country {wd:Q228}.

part to see that the results make sense.

Now for The Andorra trial there are cities with multiple coordinates:

https://www.wikidata.org/wiki/Property:P625 Which are event flagged as a problem. Wikidata Query Screenshot

I know there is work-around as explained in How to get only the most recent value from a Wikidata property? and https://w.wiki/EKB

I tried the approach in the snippet

?city p:P1082 ?populationStatement . 
  ?populationStatement ps:P1082 ?cityPopulation.
  ?populationStatement pq:P585 ?date
  FILTER NOT EXISTS { ?city p:P1082/pq:P585 ?date_ . FILTER (?date_ > ?date) } 

which makes queries real slow and in this case i am looking into all instance of human settlement which are a few hundred thousand. Even on my local wikidata copy this runs more than 3 hours !

So i wonder whether there is an alternative with MAX, AVG, Subqueries with limit or the like or any other nifty idea that would solve the issue with a decent performance?

Wolfgang Fahl
  • 15,016
  • 11
  • 93
  • 186

1 Answers1

1

You can use sample() as aggregation function (sparql doc).

Starting from you query expression, you will need to change the first line into

SELECT DISTINCT ?city ?cityLabel (max(?cityPop) as ?cityPopulation) (sample(?coord) as ?coordinate) ?region ?regionLabel ?regionIsoCode ?country ?countryLabel ?countryIsoCode ?countryPopulation ?countryGdpPerCapita WHERE {

and your second last line into:

} GROUP BY  ?city ?cityLabel ?region ?regionLabel ?regionIsoCode ?country ?countryLabel ?countryIsoCode ?countryPopulation ?countryGdpPerCapita

The result should look like this: https://w.wiki/dRV.

The work-around you tried does not work because unlike P1082 (population), P625 (coordinate) have in most cases no P585 (point in time) qualifier.

Pascalco
  • 2,481
  • 2
  • 15
  • 31