1

So I've run into a small bug/feature in R where the agrep function does not accept the "|" character as valid regular expression logic (others have had this problem too), when used in the argument.

I'm trying to do a fuzzy match of 30 different, relatively unique names in one character vector (ListofUniqueNames) against a list of over 380,000 different names in a data-frame column (MasterList$Names), and get an output of all the matching names. I was able to accomplish this fine for exact matches using grep via

grep(paste(ListofUniqueNames,collapse="|"),MasterList$Names, value=TRUE, ignore.case = TRUE)

However, this approach doesn't work for agrep due to the problem listed above. How can I accomplish this same task but with fuzzy matching?

Perf Gigi
  • 21
  • 1
  • The `grep` statement above actually works. Did you mean to use `agrep` instead of `grep` – G5W Oct 06 '17 at 23:13
  • The statement is an example of a grep function that worked for exact matches. However `agrep(paste(ListofUniqueNames,collapse="|"),MasterList$Names, value=TRUE, ignore.case = TRUE, fixed=FALSE)` does not work. Neither does `agrep('(asdf|fdsa)', 'asdf', fixed=F)` – Perf Gigi Oct 06 '17 at 23:22
  • interesting stuff: `agrep("asdf", "(asdf|fdsa)")` and then `agrep("(asdf|fdsa)", "asdf", max.distance=.55)`. – lmo Oct 06 '17 at 23:24

1 Answers1

1

You could call agrep one by one for each pattern, and then combine the results:

unlist(lapply(ListofUniqueNames, function(x) agrep(x, MasterList$Names, value=T, ignore.case = TRUE)))
janos
  • 120,954
  • 29
  • 226
  • 236