4

I'm trying to do a fuzzy (ie.. partial or case-insensitive) entity label lookup in Wikidata with Sparql (via the online endpoint). Unfortunately these return a "QueryTimeoutException: Query deadline is expired." I'm assuming this is because the query is returning too many results to run through the filter in Wikidata's 1 minute timeout.

Here's the specific query:

def findByFuzzyLabel(self, item_label):
    qstring = '''
        SELECT ?item WHERE {
            ?item rdfs:label ?label .
            FILTER( lcase(str(?label)) = "%s")
        }
        LIMIT 20
        ''' % (item_label)
    results = self.query(qstring)

Is there a way to do a partial string and/or case-insensitive label lookup on Wikidata's entity labels or will I need to do this offline on a download of raw data?

I'm looking to match labels such as "Lindbergh" to "Charles Lindbergh" and also handle case insensitivity in some instances. Any suggestions how to do this, whether via Sparql or offline in Python are appreciated.

bivouac0
  • 2,494
  • 1
  • 13
  • 28
  • 1
    Exact string matching like in your query with a limit of 20 should not timeout. Ok, you have a lower case function, maybe this prevents from using the index. For fuzzy matching you would usually need a fulltext index which is not (yet) part of the SPARQL specification. As an alternative - and not fuzzy - indeed REGEX allows for string containment matching. But again, this is expensive and needs a full scan of the data. – UninformedUser Jul 09 '17 at 18:52
  • 1
    Another good way is to use the Wikidata dump and load it into an appropriate triple store with fulltext index support or maybe just do the indexing by yourself e.g. with using Lucene. – UninformedUser Jul 09 '17 at 18:52
  • 1
    @AKSW You don't need REGEX for containment matching, though. [CONTAINS](https://www.w3.org/TR/sparql11-query/#func-contains) works just fine. :) – Joshua Taylor Jul 10 '17 at 17:48

3 Answers3

5

You can now use the MediaWiki API directly from SPARQL, using a Wikidata magic service as documented here.

Example :

SELECT * WHERE {
  SERVICE wikibase:mwapi {
      bd:serviceParam wikibase:api "EntitySearch" .
      bd:serviceParam wikibase:endpoint "www.wikidata.org" .
      bd:serviceParam mwapi:search "cheese" .
      bd:serviceParam mwapi:language "en" .
      ?item wikibase:apiOutputItem mwapi:item .
      ?num wikibase:apiOrdinal true .
  }
  ?item (wdt:P279|wdt:P31) ?type
} ORDER BY ASC(?num) LIMIT 20
mhham
  • 161
  • 1
  • 5
  • This is my go-to snippet for this purpose, and it works excellent for Lindbergh. Can’t quite figure out if the difference to this answer is significant, so one might want to try both: https://w.wiki/3o53 – Matthias Winkelmann Aug 07 '21 at 10:09
4

Be more specific. Triplestores work with things, not with strings. For example, the following query works fine:

SELECT ?item WHERE {
    ?item wdt:P735 wd:Q2958359 .
    ?item rdfs:label ?label .
    FILTER (CONTAINS(LCASE(STR(?label)), "lindbergh"))
}

If it is not possible to be sufficiently specific, you need full-text search capabilities.

  • In fact, Blazegraph supports full-text search using magic bds:search predicate, but this facility is not enabled on Wikidata.
  • Additionally, Blazegraph supports external full-text search using magic fts:search predicate. The current implementation supports Apache Solr only. Perhaps it is relatively easy to support ElasticSearch, which is used in Wikidata, but anyway, this facility is not enabled.

There is a task to provide full-text search in a form of yet another Wikidata magic service, but this functionality is still not available on the public endpoint.

As a workaround, one can use SQL queries on Quarry. This is my query on Quarry:

USE wikidatawiki_p; 
DESCRIBE wb_terms;

SELECT CONCAT("Q", term_entity_id) AS wikidata_id, term_language, term_text, term_search_key
FROM wb_terms
WHERE term_type = 'label' AND
                         term_search_key IN (LOWER('Lindbergh'), LOWER('Charles Lindbergh'));

The query time limit on Quarry is 30 minutes.

Stanislav Kralin
  • 11,070
  • 4
  • 35
  • 58
2

You can do this online if you change your filter to use the "contains" function.

Example:

 SELECT ?item WHERE {
            ?item rdfs:label ?label .
            FILTER( contains(lcase(?label), 'arles lin' ))
 }
 LIMIT 20

Reference: contains is listed as one of the XPath functions you can use in SPARQL. See: https://www.w3.org/2009/sparql/wiki/Feature:FunctionLibrary#XQuery_1.0_and_XPath_2.0_Functions_and_Operators

enter image description here

Example 2: (with more triples to optimise results)

PREFIX skos: <http://www.w3.org/2004/02/skos/core#Concept>
SELECT ?item  ?label WHERE {
            ?item rdfs:label ?label .
            ?item rdf:type dbo:Person   #Works with our without this too, also try skos:Category
            FILTER( contains(lcase(?label), 'arles lin' ) && LANGMATCHES(LANG(?label), "en")) 
 }
 LIMIT 20
Jang-Vijay Singh
  • 732
  • 3
  • 11
  • 1
    This query is unfortunately still too expensive, it has to do a full scan of the data anyways. Only fulltext indexes are supposed to be efficient enough. – UninformedUser Jul 10 '17 at 12:12
  • removing the str() typecast made it much faster than your original query. your original question anyway was about how to do the fuzzy case-insensitive search and if you try the "contains" option on dbpedia, it works. Would be interesting to look at optimization options now.. – Jang-Vijay Singh Jul 10 '17 at 12:18
  • It was not my query...and I got a timeout right now with your query. – UninformedUser Jul 10 '17 at 12:23
  • Moreover, `str` is not a typecast! It simply returns the *lexical form* of a literal. This is useful as this won't apply matching on the full literal string which additionally can consist of datatypes and/or language tags. – UninformedUser Jul 10 '17 at 12:23
  • Okay, maybe try a bit later. I added another example sparql query and screenshot to my original answer.. works fine for me. Could be made still faster if you add more optional triples (maybe match with Person, Category etc.) – Jang-Vijay Singh Jul 10 '17 at 12:48
  • Ehm, sorry man. We're not talking about the DBpedia endpoint (backend Virtuoso)...it's [Wikidata](https://query.wikidata.org) (backend Blazegraph) ... – UninformedUser Jul 10 '17 at 14:08
  • Okay. I made some tries on Wikidata and did not get much further than Stanislav's answer above – Jang-Vijay Singh Jul 10 '17 at 16:40