2

I have a list of approximately 6k wikidata instance IDs (beginning Q#####) I want to look up the human-readable labels for. I am not too familiar with SPARQL, but following some guidelines have managed to find a query that works for a single ID.

from SPARQLWrapper import SPARQLWrapper, JSON

query = """
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    PREFIX wd: <http://www.wikidata.org/entity/>
    SELECT *
    WHERE {
            wd: Q##### rdfs:label ?label .
            FILTER (langMatches( lang(?label), "EN" ) )
          }
    LIMIT 1
    """

sparql = SPARQLWrapper("http://query.wikidata.org/sparql")
sparql.setQuery(query)
sparql.setReturnFormat(JSON)
output = sparql.query().convert()

I had hoped that iterating over a list of IDs would be as simple as putting the IDs in a dataframe and using the apply function...

ids_DF['label'] = ids_DF['instance_id'].apply(my_query_function)

... However, when I do that it errors out with a "HTTPError: Too Many Requests" warning. Looking into the documentation, specifically the query limits section, it says the following:

Query limits

There is a hard query deadline configured which is set to 60 seconds. There are also following limits:

  • One client (user agent + IP) is allowed 60 seconds of processing time each 60 seconds

  • One client is allowed 30 error queries per minute

I'm unsure how to go about resolving this. Am I looking to run 6k error queries (i'm unsure what an error query even is)? In which case I presumably need to run them in batches to avoid going over the 30 second window.

My first attempt to resolve this was been to put a delay of 2 seconds after each query (see third from last line below). I noticed that each instance ID was taking approximately 1 second to return a value so my thinking was that a delay would boost the amount of time taken to 3 seconds (which should comfortably keep me within the limit). However, that still returns the same error. I've tried extending this sleep period as well, with the same results.

from SPARQLWrapper import SPARQLWrapper, JSON

query = """
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    PREFIX wd: <http://www.wikidata.org/entity/>
    SELECT *
    WHERE {
            wd: Q##### rdfs:label ?label .
            FILTER (langMatches( lang(?label), "EN" ) )
          }
    LIMIT 1
    """

sparql = SPARQLWrapper("http://query.wikidata.org/sparql")
sparql.setQuery(query)
time.sleep(2) # imported from time
sparql.setReturnFormat(JSON)
output = sparql.query().convert()

A similar question on this topic was asked here but I've not been able to follow the advice given.

cookie1986
  • 865
  • 12
  • 27
  • 2
    instead of doing 6K queries you should simply run just a few queries with the Wikidata entities provided in a batch. SPARQL `VALUES` clause is the way to go. Maybe 60 queries with 100 entities provided in each query, you can try what works best. And don't forget to make it a POST request as the query length might become to large otherwise for GET. To avoid cases with multiple Egnlish labels (are there any for rdfs:label?) you can workaround this using the Wikidata specific label service. – UninformedUser Feb 22 '22 at 20:03
  • Your code isn’t specifying a custom user agent (SPARLWrapper agent= constructor argument). Requests without a proper user agent – in violation of the user-agent policy (https://meta.wikimedia.org/wiki/User-Agent_policy) – may be throttled beyond the documented limits that you found. – Lucas Werkmeister Mar 16 '22 at 16:31

0 Answers0