0

I have a vector of locations that I am trying to disambiguate against a vector of correct location names. For this example I am using just two disambiguated locations tho:

agrepl('Au', c("Austin, TX", "Houston, TX"), 
max.distance =  .000000001, 
ignore.case = T, fixed = T)
[1] TRUE TRUE

The help page says that max.distance is

Maximum distance allowed for a match. Expressed either as integer, or as a fraction of the pattern length times the maximal transformation cost

I am not sure about the mathematical meaning of the Levensthein distance; my understanding is that smaller the distance, the stricter the tolerance for mismatches with my vector of disambiguated strings.

So I would I adjust it to retrieve two FALSE? Basically I would like to have a TRUE only when there is a difference of 1 character like in:

agrepl('Austn, TX', "Austin, TX", 
max.distance =  .000000001, ignore.case = T, fixed = T)
[1] TRUE
Dambo
  • 3,318
  • 5
  • 30
  • 79
  • Try `adist` instead. The issue is that you have partial matches occurring, so `Au` matches `*Au*stin` straight away. For example, `adist(c("Au","Austn, TX"), c("Austin, TX", "Houston, TX"), partial=FALSE)` – thelatemail Jun 01 '16 at 04:09
  • 1
    If you pass `max.distance` an integer, it uses it as the number of changes allowed instead of the proportion. You can also pass it a named list of limits for particular types of changes, e.g. `agrepl('Au', c('Austin, TX', 'Houston, TX'), max.distance = c(costs = 1, insertions = 0, deletions = 1, substitutions = 0), ignore.case = T, fixed = T)`. See `?agrep` for more. – alistaire Jun 01 '16 at 04:21
  • @thelatemail Thanks, shall I write a function to grab the string with the smallest difference or is there any specific way to retrieve the values rather then distances based on a custom threshold? @ alistaire That's what I thought, but if you check you'll see that "Au" matches "Austin, TX", which I don't want to. – Dambo Jun 01 '16 at 14:43

1 Answers1

1

The problem you are having is possibly similar to the one I faced when starting the to experiment here. The first argument is a regex-pattern when fixed=TRUE, so small patterns are very permissive if not constrained to be the full string. The help page even has a "Note" about that issue:

Since someone who read the description carelessly even filed a bug report on it, do note that this matches substrings of each element of x (just as grep does) and not whole elements.

Using regex patterns you do this by flanking the pattern string by "^" and "$", since unlike adist, agrepl has no partial parameter:

> agrepl('^Au$', "Austin, TX", 
+ max.distance =  c(insertions=.15),  ignore.case = T, fixed=FALSE)
[1] FALSE
> agrepl('^Austn, TX$', "Austin, TX", 
+ max.distance =  c(insertions=.15),  ignore.case = T, fixed=FALSE)
[1] TRUE
> agrepl('^Austn, T$', "Austin, TX", 
+ max.distance =  c(insertions=.15),  ignore.case = T, fixed=FALSE)
[1] FALSE

So you need to paste0 with those flankers:

> agrepl( paste0('^', 'Austn, Tx', '$'), "Austin, TX", 
+ max.distance =  c(insertions=.15),  ignore.case = T, fixed=FALSE)
[1] TRUE
> agrepl( paste0('^', 'Au', '$'), "Austin, TX", 
+ max.distance =  c(insertions=.15),  ignore.case = T, fixed=FALSE)
[1] FALSE

Might be better to use all rather than just insertions, and you may want to lower the fraction.

IRTFM
  • 258,963
  • 21
  • 364
  • 487