1

This seems really simple but for some reason, I don't understand the behavior of agrep fuzzy matching involving substitutions. Two substitutions produce a match as expected when all=2 is specified, but not when substitutions=2. Why is this?

# Finds a match as expected
agrep("abcdeX", "abcdef", value = T,
      max.distance = list(sub=1, ins=0, del=0))
#> [1] "abcdef"


# Doesn't find a match as expected
agrep("abcdXX", "abcdef", value = T,
      max.distance = list(sub=1, ins=0, del=0))
#> character(0)


# Finds a match as expected
agrep("abcdXX", "abcdef", value = T,
      max.distance = list(all=2))
#> [1] "abcdef"
      

# Doesn't find a match UNEXPECTEDLY
agrep("abcdXX", "abcdef", value = T,
      max.distance = list(sub=2, ins=0, del=0))
#> character(0)

Created on 2021-06-03 by the reprex package (v2.0.0)

Atakan
  • 416
  • 3
  • 14

1 Answers1

2

all is an upper limit which always applies, regardless of other max.distance controls (other than cost). It defaults to 10%.

# one characters can change
agrep(pattern = "abcdXX", x = "abcdef", value = TRUE,
     max.distance = list(sub = 2, ins = 0, del = 0, all = 0.1))
# character(0)

# two characters can change
agrep(pattern = "abcdXX", x = "abcdef", value = TRUE,
     max.distance = list(sub = 2, ins = 0, del = 0, all = 0.2))
# [1] "abcdef"

# one character can change
agrep(pattern = "abcdXX", x = "abcdef", value = TRUE,
    max.distance = list(sub = 1, ins = 1, del = 0, all = 0.1))
# character(0)

# two characters can change
agrep(pattern = "abcdXX", x = "abcdef", value = TRUE,
    max.distance = list(sub = 1, ins = 1, del = 0, all = 0.2))
# [1] "abcdef"

There's a bit of a gotcha that the fractional mode of setting all switches to the integer mode at 1.

# 8 insertions allowed
agrep(pattern = "abcdXXef", x = "abcdef", value = TRUE,
    max.distance = list(sub = 0, ins = 2, del = 0, all = 1 - 1e-9))
# [1] "abcdef"

# 1 insertion allowed
agrep(pattern = "abcdXXef", x = "abcdef", value = TRUE,
    max.distance = list(sub = 0, ins = 2, del = 0, all = 1))
# character(0)

When you suppress all by setting it to just less than 1, the limits on the distance mode apply.

# two substitutions allowed
agrep(pattern = "abcdXX", 
    x = c("abcdef", "abcXdef", "abcefg"), value = TRUE,
    max.distance = list(sub = 2, ins = 0, del = 0, all = 1 - 1e-9))
# [1] "abcdef"

The purpose of setting the cost is to allow you to move around the mutation-space at different rates in different directions. This is going to depend on your use case. For example some language dialects might be more likely to add letters. You might chose to let a deletion cost two insertions. By default, all are equally weighted when costs = NULL, i.e. costs = c(ins = 1, del = 1, sub = 1).

EDIT: regarding your comment about why some patterns match and others don't, the 10% refers to the number of characters in the pattern, rounding up.

agrep(pattern = "01234567XX89", x = "0123456789", value = TRUE, 
    max.distance = list(sub = 0, ins = 2, del = 0))
# [1] "0123456789"
agrep(pattern = "01234567XX", x = "0123456789", value = TRUE, 
    max.distance = list(sub = 2, ins = 0, del = 0))
# character(0)
num_mutations <- nchar(c("01234567XX89", "01234567XX")) * 0.1
num_mutations
# [1] 1.2 1.0
ceiling(num_mutations)
[1] 2 1

The second pattern is only 10 characters, so only one substitution is allowed.

CSJCampbell
  • 2,025
  • 15
  • 19
  • Thanks for the reply. Does that mean `max.distance` arguments are useless as long as their constraint is more permissive than what `all` argument controls? I read the function help 10 times but I can't understand what `cost` actually controls in `max.distance`. How can I allow only 2 substitutions and nothing else in this approach? – Atakan Jun 03 '21 at 21:44
  • 1
    With some experimentation, I noticed `cost` controls how many "mistakes" are allowed. Is that right? For instance, my example works if I set `sub=2` and `cost=2`. But it doesn't work if either is set to `1`. Intuitively it would have made more sense to me if `cost` was the addition of the constraints by default without manual specification. Maybe I'm missing something – Atakan Jun 03 '21 at 21:52
  • 1
    Thanks for updating the answer. It makes more sense now for this example. Can you help me understand how I should think about defining rules for future? It seems like both `ins` and `all` controls how many insertions can be allowed. I'm thinking to set `all=1-1e-9` to keep this argument in fraction mode, while determining how many mismatches/insertions I can allow using arguments like `sub` and `ins`. Is this a solid approach? – Atakan Jun 04 '21 at 23:15
  • Sorry to bombard you with comments but, I'm still confused with how `all` affects other arguments. I find the following perplexing: This works: `agrep("01234567XX89", "0123456789", value = T, max.distance = list(sub=0, ins=2, del=0))` This doesn't: `agrep("01234567XX", "0123456789", value = T, max.distance = list(sub=2, ins=0, del=0))` Can't wrap my mind around why. If `all` is a common upper bound defaulted to `0.1`, both examples should fail to find a match – Atakan Jun 05 '21 at 00:09