3

I'm trying to use stringdist to identify all strings with a max distance of 1 in the same vector, and then publish the match. Here is a sample of the data:

Starting data frame:

a = c("tom", "tomm", "alex", "alexi", "chris", "jen", "jenn", "michell") 
b = c(NA) 
df = data.frame(a,b) 

Desired results:

a = c("tom", "tomm", "alex", "alexi", "chris", "jen", "jenn", "michell") 
b = c("tomm", "tom", "alexi", "alex", 0, "jenn", "jen", 0) 
df = data.frame(a,b) 

I can use stringdist for two vectors, but am having trouble using it for one vector. Thanks for your help, R community.

richiepop2
  • 348
  • 1
  • 12
  • What happens if there is more than one string that matches. (e.g. if there were "tom", "tomm", and "tommy")? Would "tomm" get two strings associated with it? – Jota Jan 10 '17 at 02:50
  • Good question. I wouldn't need two matches, though, just one. I believe by default, stringdist matches the first match, which would be acceptable to me. – richiepop2 Jan 10 '17 at 02:51

4 Answers4

3

You can use stringdistmatrix and which.min:

df = data.frame(a,b, stringsAsFactors = FALSE)
mat <- stringdistmatrix(df$a, df$a)
mat[mat==0] <- NA # ignore self
mat[mat>4] <- NA  # cut level
amatch <- rowSums(mat, na.rm = TRUE)>0 # ignore no match
df$b[amatch] <- df$a[apply(mat[amatch,],1,which.min)]
        a     b
1     tom  tomm
2    tomm   tom
3    alex alexi
4   alexi  alex
5   chris  <NA>
6     jen  jenn
7    jenn   jen
8 michell  <NA>
HubertL
  • 19,246
  • 3
  • 32
  • 51
3

Here's one possible approach:

a = c("tom", "tomm", "alex", "alexi", "chris", "jen", "jenn", "michell") 

min_dist <- function(x, method = "cosine", tol = .5){
    y <- vector(mode = "character", length = length(x))
    for(i in seq_along(x)){
        dis <- stringdist(x[i], x[-i], method)
        if (min(dis) > tol) {
            y[i] <- "0"
        } else {
            y[i] <- x[-i][which.min(dis)]
        }
    }
    y
}

min_dist(a, 'cosine', .4)

## [1] "tomm"  "tom"   "alexi" "alex"  "0"      "jenn"  "jen"   "0"
Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519
  • thanks @ Tyler Rinker! This works perfectly on the sample I provided. When I run it against the larger (54k observations ) I'm working with, I receive the following error message: `Warning messages: 1: In stringdist(x[i], x[-i], method) : You are passing one or more arguments of type 'list' to 'stringdist'. These arguments will be converted with 'as.character' which is likeley not to give what you want (did you mean to use 'seqdist'?). This warning can be avoided by explicitly converting the argument(s).` Any idea on what's going on there? Thanks, again. – richiepop2 Jan 10 '17 at 03:36
  • That is a warning not an error. Did it still run? It seems your actual `a` is a list not a vector. Maybe try using `unlist` on it? – Tyler Rinker Jan 10 '17 at 03:57
  • I was able to get this to work. Thank you @Tyler Rinker! – richiepop2 Jan 10 '17 at 05:04
  • @richiepop2 How did you deal with that warning messages? Did the `unlist` work like @Tyler Rinker mentioned? – wake_wake Mar 08 '17 at 05:35
0

We can use adist from base R too:

library(reshape2)
out <- as.data.frame(adist(df$a)) #as.matrix(stringdistmatrix(df[,1])))
out$names <- names(out) <- df$a
out <- subset(melt(out, id='names'), value==1)[1:2]
names(out) <- names(df)
out <- rbind(out, data.frame(a=setdiff(unique(df[,1]), out$a), b='0'))
out
#   a     b
#2     tomm   tom
#9      tom  tomm
#20   alexi  alex
#27    alex alexi
#47    jenn   jen
#54     jen  jenn
#7    chris     0
#8  michell     0
Sandipan Dey
  • 21,482
  • 2
  • 51
  • 63
0

here's a short solution:

df$b <- sapply(seq_along(df$a), function(i){ 
  lookup <- a[-i]
  j <- stringdist::amatch(a[i], lookup, maxDist = 1)
  if (is.na(j)) NA_character_ else lookup[j]
})