2

I understand that the q-gram distance is the sum of absolute differences between q-gram vectors of both strings. But I see some weird behavior when one of the strings is shorter than the chosen q.

So for these two strings, while the qgrams function is correct:

> qgrams("a", "the cat sat on the mat", q = 2)
   th he t  sa on n  ma e   c ca at  s  t  o  m
V1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
V2  2  2  2  1  1  1  1  2  1  1  3  1  1  1  1

The stringdist function returns:

> stringdist("a", "the cat sat on the mat", q = 2, method = "qgram")
[1] Inf

Instead of returning:

> sum(qgrams("a", "the cat sat on the mat", q = 2)[2,])
[1] 21

Did I miss something or is this a bug? Thanks.

stringdist versions: 0.9.4.1 and 0.9.4.2

Giora Simchoni
  • 3,487
  • 3
  • 34
  • 72

1 Answers1

2

Currently stringdist::stringdist assumes an undefined (Inf) distance when q is larger than the string length.

My reasoning at the time was probably that the map from {the set of all strings over an alphabet Sigma} to {positive integer vectors of length |Sigma|^q} has no explicit definition if q is less than the input string length. This is also how I wrote it down in the stringdist paper.

qgrams maps such cases to the 0-vector, which is indeed inconsistent.

If I take the definition in the paper of Ukkonen (1992) mapping to the 0-vector is indeed the right choice, implying a bug in stringdist.

Will fix.