0

The pattern list looks like:

pattern <- c('aaa','bbb','ccc','ddd')

X came from df looks like:

df$X <- c('aaa-053','aaa-001','aab','bbb')

What I tried to do: use agrep to find the matching name in pattern based on df$X, then assign value to an existing column 'column2' based on the matching result, for example, if 'aaa-053' matched 'aaa', then 'aaa' would be the value in 'column2', if not matched, then return na in that column.

for (i in 1:length(pattern)) {
 match <- agrep(pattern, df$X, ignore.case=TRUE, max=0)
 if agrep = TRUE {
   df$column2 <- pattern
 } else {df$column2 <- na
 }
}

Ideal column2 in df looks like:

'aaa','aaa',na,'bbb'
Cyrus
  • 84,225
  • 14
  • 89
  • 153

1 Answers1

0

agrep by itself isn't going to give you much to determine which to use when multiples match. For instance,

agrep(pattern[1], df$x)
# [1] 1 2 3

which makes sense for the first two, but the third is not among your expected values. Similarly, it's feasible that it might select multiple patterns for a given string.

Here's an alternative:

D <- adist(pattern, df$x, fixed = FALSE)
D
#      [,1] [,2] [,3] [,4]
# [1,]    0    0    1    3
# [2,]    3    3    2    0
# [3,]    3    3    3    3
# [4,]    3    3    3    3
D[D > 0] <- NA
D
#      [,1] [,2] [,3] [,4]
# [1,]    0    0   NA   NA
# [2,]   NA   NA   NA    0
# [3,]   NA   NA   NA   NA
# [4,]   NA   NA   NA   NA
apply(D, 2, function(z) which.min(z)[1])
# [1]  1  1 NA  2
pattern[apply(D, 2, function(z) which.min(z)[1])]
# [1] "aaa" "aaa" NA    "bbb"
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • I think this works for numerical data, but my condition is based on character values, that is why I initially tried to use agrep – onemikeone Mar 28 '21 at 22:11
  • I have *no idea* what you mean by that. `agrep` works on strings, not numbers, as does `adist`. The point of this answer is to (1) demonstrate that your assumption of single matches is flawed; and (2) suggest a methodology to try to mitigate that shortcoming. The "numeric" portion of this answer is find to the smallest "distance" between strings, which should indicate the best match. If you want to avoid numbers, then I suggest you either use data that never has the risk of overlap as your sample does, or devise methods that are much more intuitive than fuzzy string matching. Good luck! – r2evans Mar 29 '21 at 12:18
  • Another perspective: given your sample data, this gives the answer you expected, as strings. Is there another condition or property of the data that makes this code function poorly? Are there other data where this does not work? If it fails with your real data, then you cannot expect anything better if you don't improve your question to include more representative sample data. – r2evans Mar 29 '21 at 12:21