Matching specific Geonames IDs with Wikidata IDs using Pywikibot

Question

I have an extensive list of Geonames IDs for which I want to find the matching Wikidata IDs. I would like to use Pywikibot and, if possible, iterate over the list.

The SPARQL query for an individual Geonames ID would be:

SELECT DISTINCT ?item ?itemLabel WHERE {
  SERVICE wikibase:label { bd:serviceParam wikibase:language "de". }
  {
    SELECT DISTINCT ?item WHERE {
      ?item p:P1566 ?statement0.
      ?statement0 (ps:P1566) "2867714".
    }
  }
}

2867714 is the Geonames ID for Munich, and running the query via the following script returns the correct Wikidata ID:

import pywikibot
from pywikibot import pagegenerators as pg

# read query file

with open('C:\\Users\\p70076654\\Downloads\\SPARQL_mapGeonamesID.rq', 'r') as query_file:
    QUERY = query_file.read()
    #print(QUERY)
    
# create generator based on query
# returns an iterator that produces a sequence of values when iterated over
# useful when creating large sequences of values

wikidata_site = pywikibot.Site("wikidata", "wikidata")
generator = pg.WikidataSPARQLPageGenerator(QUERY, site=wikidata_site)

print(generator)

# OUTPUT: <generator object WikidataSPARQLPageGenerator.<locals>.<genexpr> at 0x00000169FAF3FD10>

# iterate over generator

for item in generator:
    print(item)

The correct output returned is: wikidata:Q32664319

Ideally, I want to replace the specific ID for a variable to add IDs from my list successively. I checked the Pywikibot documentation but could not find information on my specific use case. How can I ingest replace the individual ID for a variable and iterate over my ID list?

logi-kal · Accepted Answer · 2023-07-25T20:37:06.040

3

First, why do you use a subquery? You can simplify its synthax as:

SELECT DISTINCT ?item ?itemLabel WHERE {
  SERVICE wikibase:label { bd:serviceParam wikibase:language "de". }
  ?item p:P1566/ps:P1566 "2867714".
}

Coming to your question, you can use python's string interpolation for generalizing your query as:

SELECT DISTINCT ?item ?itemLabel WHERE {
  SERVICE wikibase:label { bd:serviceParam wikibase:language "de". }
  ?item p:P1566/ps:P1566 "%s".
}

and then instantiate it as QUERY % "2867714".

With a list of ids, it would be something like:

with open('C:\\Users\\p70076654\\Downloads\\SPARQL_mapGeonamesID.rq', 'r') as query_file:
    QUERY = query_file.read()

geonames_ids = ["2867714", "2867715", "2867716"]
for geonames_id in geonames_ids :
    wikidata_site = pywikibot.Site("wikidata", "wikidata")
    generator = pg.WikidataSPARQLPageGenerator(QUERY % geonames_id, site=wikidata_site)
    ...

edited Jul 25 '23 at 20:37

answered Jul 25 '23 at 12:40

logi-kal

7,107
6
31
43

This can be much more efficient by only reading the query once (or just including it in a string variable which would have the advantage of being more readable). Also, labels are expensive to look up. If they aren't be used, which they don't appear to be, they should be left out of the SPARQL query. – Tom Morris Jul 25 '23 at 20:26
@TomMorris You're right, there was no point in reading the query multiple times. For what concerns the label, I don't know what the OP is doing with them, and it is not relevant for the question. – logi-kal Jul 25 '23 at 20:36
I am using the labels to make all output "human-readable" as intermediate results are sent to colleagues for data checks. – OnceUponATime Jul 25 '23 at 23:46
Got it. I got mislead by the "correct output" description in the original question. – Tom Morris Jul 26 '23 at 15:54

Matching specific Geonames IDs with Wikidata IDs using Pywikibot

1 Answers1