2

why does agrep find a match although I restrict max.distance to zero? adist does correctly tell me that I need two insertations...

> agrep("ab", "abcd", max = list(del = 0, ins = 0, sub = 0), value = T)
[1] "abcd"
> drop(attr(adist("ab", "abcd", counts = TRUE), "counts"))
ins del sub 
  2   0   0 

> sessionInfo()
R version 2.15.1 (2012-06-22)
Platform: x86_64-pc-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=German_Austria.1252  LC_CTYPE=German_Austria.1252   
[3] LC_MONETARY=German_Austria.1252 LC_NUMERIC=C                   
[5] LC_TIME=German_Austria.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] formatR_0.6   vegan_2.0-4   permute_0.7-0

loaded via a namespace (and not attached):
[1] grid_2.15.1    lattice_0.20-6 tools_2.15.1 
Kay
  • 2,702
  • 6
  • 32
  • 48
  • 2
    The Note section in `?agrep` would seem to apply to you here. – joran Sep 27 '12 at 20:35
  • @joran I read the note but can't see why it should apply here.. – Kay Sep 27 '12 at 22:37
  • The key phrase is that `agrep` matches the pattern **within** each string in x, not to the whole string. Subtle (which is probably why they added the note). – joran Sep 27 '12 at 23:25

1 Answers1

3

Notice that "ab" matches perfectly (no insertions needed!) with the first two characters of "abcd". That is what the Note section that @joran referenced is telling you.

# Since ab matches the substring of abcd that is the first two characters
# we get a match
agrep("ab", "abcd", val = T)
#[1] "abcd"

# Since we only need 1 insertion to make ac into abc and we set max=1
# we get a match
agrep("ac", "abcd", max = 1, val = T)
#[1] "abcd"

# ac doesn't directly match any part of the substring and we set max=0
# so no match
agrep("ac", "abcd", max = 0, val = T)
#character(0)
Dason
  • 60,663
  • 9
  • 131
  • 148
  • +1 - nice explanation. `agrep` has always been a bit of a mysterious creature to me. – thelatemail Sep 27 '12 at 23:52
  • I already explained why you're getting the result you're getting (it's by design). If you want something else you could just use `adist`. But really it sounds like you're looking for exact matches in which case `agrep` and `adist` really aren't the functions you should be using... – Dason Sep 28 '12 at 17:31
  • @dason, I deleted my comment. You're explanation was clear. However, I'm rather unhappy with the design of agrep (matching of substrings). i.e. I dislike the fact that you get different results in `agrep("abcd", "ab")` and `agrep("ab", "abcd")`.. – Kay Sep 29 '12 at 08:48
  • You're allowed to be unhappy with it but that doesn't change the design. I personally think it makes a lot of sense. Say I have a bunch of sentences (each sentence being a string in a character vector) I need to look through and want to scan for "Dason" but I'm afraid some people have done some entry errors so I want to look for things that might be close to Dason as well. `agrep` makes this task relatively easy. Personally it just sounds like you should be using `adist` directly. – Dason Sep 29 '12 at 16:03