Query for best match to a string with SPARQL?

Question

I have a list with movie titles and want to look these up in DBpedia for meta information like "director". But I have trouble to identify the correct movie with SPARQL, because the titles sometimes don't exactly match.

How can I get the best match for a movie title from DBpedia using SPARQL?

Some problematic examples:

My List: "Die Hard: with a Vengeance" vs. DBpedia: "Die Hard with a Vengeance"
My List: "Hachi" vs. DBpedia: "Hachi: A Dog's Tale"

My current approach is to query the DBpedia endpoint for all movies and then filter by checking for single tokens (without punctuations), order by title and return the first result. E.g.:

SELECT ?resource ?title ?director WHERE {
   ?resource foaf:name ?title .
   ?resource rdf:type schema:Movie .
   ?resource dbo:director ?director .
   FILTER (
      contains(lcase(str(?title)), "die") && 
      contains(lcase(str(?title)),"hard")
   )
}
ORDER BY (?title)
LIMIT 1

This approach is very slow and also sometimes fails, e.g.:

SELECT ?resource ?title ?director WHERE {
   ?resource foaf:name ?title .
   ?resource rdf:type schema:Movie .
   ?resource dbo:director ?director .
   FILTER (
      contains(lcase(str(?title)), "hachi") 
   )
}
ORDER BY (?title)
LIMIT 10

where the correct result is on second place:

  resource                                          title                        director
  http://dbpedia.org/resource/Chachi_420            "Chachi 420"@en              http://dbpedia.org/resource/Kamal_Haasan
  http://dbpedia.org/resource/Hachi:_A_Dog's_Tale   "Hachi: A Dog's Tale"@en     http://dbpedia.org/resource/Lasse_Hallström    
  http://dbpedia.org/resource/Hachiko_Monogatari    "Hachikō Monogatari"@en      http://dbpedia.org/resource/Seijirō_Kōyama
  http://dbpedia.org/resource/Thachiledathu_Chundan "Thachiledathu Chundan"@en   http://dbpedia.org/resource/Shajoon_Kariyal

Any ideas how to solve this problem? Or even better: How to query for best matches to a string with SPARQL in general?

Thanks!

SPARQL endpoints are not text search engine, thus, there is only limited support for string matching in the SPARQL standards. Some triple stores do have some extended support, depending on the underlying implementation. E.g. some triple stores use Lucene for text search, while others like Virtuoso have some built-in functions. — UninformedUser, Jul 30 '16 at 07:46
The DBpedia endpoint uses Virtuoso, so you could have a look at http://docs.openlinksw.com/virtuoso/rdfsparqlrulefulltext/ . E.g. `bif:contains` is much faster on indexed literals than regular REGEX. An example from the docs is `?s foaf:Name ?name . ?name bif:contains "'rich*'".` which would match all subjects whose `foaf:Name` contain the word Rich. This would match Richard, Richie etc. — UninformedUser, Jul 30 '16 at 07:49
@AKSW Thanks for the hint with bif:contains. I will take a look at that. — dynobo, Jul 30 '16 at 07:55
Have a look at http://stackoverflow.com/questions/24557020/to-use-isparql-to-compare-values-using-similarity-measures. As mentioned, SPARQL isn't really for string processing, but though can do a lot, even if it won't be super performant. That link shows how you can compute some edit distances with Sparql. — Joshua Taylor, Jul 30 '16 at 21:23
@JoshuaTaylor Thanks for the link! I tried that approach and came up with a pretty good working solution (see my answer). — dynobo, Jul 31 '16 at 14:19
@AKSW Can Lucene be added to Fuseki? I found [this](https://users.jena.apache.narkive.com/TXOQYQ8x/configuring-fuseki-with-both-lucene-and-reasoning), however it seems this is related to Fuseki's jena API. Of course, it won't be much useful on the standalone server. But it's good for testing purposes. — RFAI, Jul 01 '19 at 12:24
@RFNO https://jena.apache.org/documentation/query/text-query.html#working-with-fuseki — UninformedUser, Jul 01 '19 at 12:33

score 2 · Answer 1 · edited May 23 '17 at 12:10

2

I adapted the regex-approach mentioned in the comments and came up with a solution that works pretty well, better than anything I could get with bif:contains:

   SELECT ?resource ?title ?match strlen(str(?title)) as ?lenTitle strlen(str(?match)) as ?lenMatch

   WHERE {
      ?resource foaf:name ?title .
      ?resource rdf:type schema:Movie .
      ?resource dbo:director ?director .
      bind( replace(LCASE(CONCAT('x',?title)), "^x(die)*(?:.*?(hard))*(?:.*?(with))*.*$", "$1$2$3") as ?match ) 
   }

   ORDER BY DESC(?lenMatch) ASC(?lenTitle)

   LIMIT 5

It's not perfect, so I'm still open for suggestions.

edited May 23 '17 at 12:10

Community

1
1

answered Jul 31 '16 at 14:18

dynobo

675
1
6
15

Can you explain what each part is doing? I want to be able to search for "Die_Hard" while ignoring the _ (underline) and making it case insensitive. I searched with your code and it gave me too many hits! – RFAI Jul 08 '19 at 06:01

Query for best match to a string with SPARQL?

1 Answers1

Linked