To use iSPARQL to compare values using similarity measures

Question

I have a question for you.

I would like to write a query that retrieves the values that are similar (given a function of similarity, such as Lev) to a given string "Londn" to make the comparison with the predicate "RDFS:label" of DBPedia. In Output, for example, I would like to get the value of "London". I have read that a usable approach might be to use iSPARQL ("Imprecise SPARQL") although it is not very widely used in the literature.

Can I use iSPARQL or is there some SPARQL approach to perform the same operations?

Joshua Taylor · Accepted Answer · 2014-07-03T17:40:57.177

Short Version — You can do some of this in pure SPARQL

You can use a query like this to find cities whose names are like "Londn", and order them by (one measure of) similarity. The rest of the answer explains how this works:

select ?city ?percent where {
  ?city a dbpedia-owl:City ;
        rdfs:label ?label .
  filter langMatches( lang(?label), 'en' )

  bind( replace( concat( 'x', str(?label) ), "^x[^Londn]*([L]?)[^ondn]*([o]?)[^ndn]*([n]?)[^dn]*([d]?)[^n]*([n]?).*$", '$1$2$3$4$5' ) as ?match )
  bind( xsd:float(strlen(?match))/strlen(str(?label)) as ?percent )
}
order by desc(?percent)
limit 100

SPARQL results

city                                  percent
----------------------------------------------
http://dbpedia.org/resource/London    0.833333
http://dbpedia.org/resource/Bonn      0.75
http://dbpedia.org/resource/Loudi     0.6
http://dbpedia.org/resource/Ladnu     0.6
http://dbpedia.org/resource/Lonar     0.6
http://dbpedia.org/resource/Longnan   0.571429
http://dbpedia.org/resource/Longyan   0.571429
http://dbpedia.org/resource/Luoding   0.571429
http://dbpedia.org/resource/Lodhran   0.571429
http://dbpedia.org/resource/Lom%C3%A9 0.5
http://dbpedia.org/resource/Andong    0.5

Computing string similarity metrics

Note: the code in this part of this answer works in Apache Jena. There's actually an edge case that causes this to (correctly) fail in Virtuoso. The update at the end addresses this issue.

There's nothing built into SPARQL for computing string matching distances, but you can do some of it using the regular expression replacement mechanism in SPARQL. Suppose you want to match the sequence "cat" in some strings. Then you can use a query like this to figure out how much of a given string is present in the sequence "cat":

select ?string ?match where {
  values ?string { "cart" "concatenate" "hat" "pot" "hop" }
  bind( replace( ?string, "^[^cat]*([c]?)[^at]*([a]?)[^t]*([t]?).*$", "$1$2$3" ) as ?match )
}

-------------------------
| string        | match |
=========================
| "cart"        | "cat" |
| "concatenate" | "cat" |
| "hat"         | "at"  |
| "pot"         | "t"   |
| "hop"         | ""    |
-------------------------

By examining the length of string and match, you should be able to compute some various similarity metrics. As a more complicated example that uses the "Londn" input you mentioned. The percent column is the percent of the string that matched the input.

select ?input
       ?string
       (strlen(?match)/strlen(?string) as ?percent)
where {
  values ?string { "London" "Londn" "London Fog" "Lando" "Land Ho!"
                   "concatenate" "catnap" "hat" "cat" "chat" "chart" "port" "part" }

  values (?input ?pattern ?replacement) {
    ("cat"   "^[^cat]*([c]?)[^at]*([a]?)[^t]*([t]?).*$"                              "$1$2$3")
    ("Londn" "^[^Londn]*([L]?)[^ondn]*([o]?)[^ndn]*([n]?)[^dn]*([d]?)[^n]*([n]?).*$" "$1$2$3$4$5")
  }

  bind( replace( ?string, ?pattern, ?replacement) as ?match )
}
order by ?pattern desc(?percent)

--------------------------------------------------------
| input   | string        | percent                    |
========================================================
| "Londn" | "Londn"       | 1.0                        |
| "Londn" | "London"      | 0.833333333333333333333333 |
| "Londn" | "Lando"       | 0.6                        |
| "Londn" | "London Fog"  | 0.5                        |
| "Londn" | "Land Ho!"    | 0.375                      |
| "Londn" | "concatenate" | 0.272727272727272727272727 |
| "Londn" | "port"        | 0.25                       |
| "Londn" | "catnap"      | 0.166666666666666666666666 |
| "Londn" | "cat"         | 0.0                        |
| "Londn" | "chart"       | 0.0                        |
| "Londn" | "chat"        | 0.0                        |
| "Londn" | "hat"         | 0.0                        |
| "Londn" | "part"        | 0.0                        |
| "cat"   | "cat"         | 1.0                        |
| "cat"   | "chat"        | 0.75                       |
| "cat"   | "hat"         | 0.666666666666666666666666 |
| "cat"   | "chart"       | 0.6                        |
| "cat"   | "part"        | 0.5                        |
| "cat"   | "catnap"      | 0.5                        |
| "cat"   | "concatenate" | 0.272727272727272727272727 |
| "cat"   | "port"        | 0.25                       |
| "cat"   | "Lando"       | 0.2                        |
| "cat"   | "Land Ho!"    | 0.125                      |
| "cat"   | "Londn"       | 0.0                        |
| "cat"   | "London"      | 0.0                        |
| "cat"   | "London Fog"  | 0.0                        |
--------------------------------------------------------

Update

The code above works in Apache Jena, but fails in Virtuoso, due to the fact that the pattern can match the empty string. E.g., if you try the following query on DBpedia's endpoint (which is powered by Virtuoso), you'll get the following error:

select (replace( "foo", ".*", "x" ) as ?bar) where {}

Virtuoso 22023 Error The regex-based XPATH/XQuery/SPARQL replace() function can not search for a pattern that can be found even in an empty string

This surprised me, but the spec for replace says that it's based on XPath fn:replace. The docs for fn:replace say:

An error is raised [err:FORX0003] if the pattern matches a zero-length string, that is, if the expression fn:matches("", $pattern, $flags) returns true. It is not an error, however, if a captured substring is zero-length.

We can get around this problem, though, by adding a character to the beginning of both the pattern and the string:

select ?input
       ?string
       (strlen(?match)/strlen(?string) as ?percent)
where {
  values ?string { "London" "Londn" "London Fog" "Lando" "Land Ho!"
                   "concatenate" "catnap" "hat" "cat" "chat" "chart" "port" "part" }

  values (?input ?pattern ?replacement) {
    ("cat"   "^x[^cat]*([c]?)[^at]*([a]?)[^t]*([t]?).*$"                              "$1$2$3")
    ("Londn" "^x[^Londn]*([L]?)[^ondn]*([o]?)[^ndn]*([n]?)[^dn]*([d]?)[^n]*([n]?).*$" "$1$2$3$4$5")
  }

  bind( replace( concat('x',?string), ?pattern, ?replacement) as ?match )
}
order by ?pattern desc(?percent)

--------------------------------------------------------
| input   | string        | percent                    |
========================================================
| "Londn" | "Londn"       | 1.0                        |
| "Londn" | "London"      | 0.833333333333333333333333 |
| "Londn" | "Lando"       | 0.6                        |
| "Londn" | "London Fog"  | 0.5                        |
| "Londn" | "Land Ho!"    | 0.375                      |
| "Londn" | "concatenate" | 0.272727272727272727272727 |
| "Londn" | "port"        | 0.25                       |
| "Londn" | "catnap"      | 0.166666666666666666666666 |
| "Londn" | "cat"         | 0.0                        |
| "Londn" | "chart"       | 0.0                        |
| "Londn" | "chat"        | 0.0                        |
| "Londn" | "hat"         | 0.0                        |
| "Londn" | "part"        | 0.0                        |
| "cat"   | "cat"         | 1.0                        |
| "cat"   | "chat"        | 0.75                       |
| "cat"   | "hat"         | 0.666666666666666666666666 |
| "cat"   | "chart"       | 0.6                        |
| "cat"   | "part"        | 0.5                        |
| "cat"   | "catnap"      | 0.5                        |
| "cat"   | "concatenate" | 0.272727272727272727272727 |
| "cat"   | "port"        | 0.25                       |
| "cat"   | "Lando"       | 0.2                        |
| "cat"   | "Land Ho!"    | 0.125                      |
| "cat"   | "Londn"       | 0.0                        |
| "cat"   | "London"      | 0.0                        |
| "cat"   | "London Fog"  | 0.0                        |
--------------------------------------------------------

Thanks Joshua. Your answer is very attractive. Your proposed method takes into consideration that I know in advance the string and the length of it. If we suppose that we have a string "calculated" at runtime in the code, it becomes complicated to reuse this query, or am I wrong? — Musich87, Jul 04 '14 at 07:00
If you get the input at runtime, it shouldn't be too hard to programmatically constuct the pattern and replacement string from it. Depending on how you're doing this in code, it shouldn't be too hard to inject those into your query. — Joshua Taylor, Jul 04 '14 at 14:04
Hi Joshua, I'm reading your answer and I have noticed that if I run your SPARQL query (the first), I don't get London in the response. Why? — Musich87, Sep 22 '14 at 11:06
@Musich87 The query still works correctly, but it asks for things that have type dbpedia-owl:City. If you run the query `select ?type { dbpedia:London a ?type }`, you'll see that dbpedia:London apparently no longer has type dbpedia-owl:City. I have no idea why that would be the case, but that's why it doesn't show up in the query results. That's part of the danger of using a live dataset; the data can change while we're not looking. You can change dbpedia-owl:CIty to dbpedia-owl:Settlement and you'll see London again. — Joshua Taylor, Sep 22 '14 at 11:35
ok, I have understood. I didn't see that dbpedia:London doesn't have type dbpedia-owl:City. Thanks. — Musich87, Sep 22 '14 at 12:00

To use iSPARQL to compare values using similarity measures

1 Answers1

Short Version — You can do some of this in pure SPARQL

Computing string similarity metrics

Update

Linked