1

I have a data.table dt with 3 columns:

  • id
  • name as string
  • threshold as num

A sample is:

dt <- <- data.table(nid = c("n1","n2", "n3", "n4"), rname = c("apple", "pear", "banana", "kiwi"), maxr = c(0.5, 0.8, 0.7, 0.6))

nid | rname  | maxr
n1  | apple  |  0.5
n2  | pear   |  0.8
n3  | banana |  0.7
n4  | kiwi   |  0.6

I have a second table dt.ref with 2 columns:

  • id
  • name as string

A sample is:

dt.ref <- <- data.table(cid = c("c1", "c2", "c3", "c4", "c5", "c6"), cname = c("apple", "maple", "peer", "dear", "bonobo", "kiwis"))

cid | cname
c1  | apple
c2  | maple
c3  | peer
c4  | dear
c5  | bonobo
c6  | kiwis

For each rname of dt, I would like to compute the Levenshtein ratio with each cname of dt.ref as such:

Lr = 1 - (stringdist(cname, rname, method = "lv") / pmax(nchar(cname),nchar(rname)))

Then, I would like to find max(Lr) over the cname for each rname of dt and get as an output the following data.table:

nid | rname  | maxr | maxLr | cid
n1  | apple  |  0.5 | 1     | c1
n2  | pear   |  0.8 | 0.75  | c3
n2  | pear   |  0.8 | 0.75  | c4
n3  | banana |  0.7 | 0.33  | c5
n4  | kiwi   |  0.6 | 0.8   | c6

Basically, we take dt and add 2 columns, the maximum Levenshtein ratio and the corresponding cid, knowing that ties are all added, 1 per row as for n2.

I use data.table but the solution can use dplyr or any other package.

lmo
  • 37,904
  • 9
  • 56
  • 69
user2590177
  • 167
  • 1
  • 11

1 Answers1

1

You can try something like this:

f1 <- function(x, y) {
  require(stringdist)
  require(matrixStats)
  dis  <- stringdistmatrix(x, y, method = "lv")
  mat <- sapply(nchar(y), function(i) pmax(i, nchar(x)))
  r <- 1 - dis / mat
  w <- apply(r, 1, function(x) which(x==max(x)))
  m <- rowMaxs(r)
  list(m = m, w = w)
}

r <- f1(dt[[2]], dt.ref[[2]])
r
$m
[1] 1.0000000 0.7500000 0.3333333 0.8000000

$w
$w[[1]]
[1] 1

$w[[2]]
[1] 3 4

$w[[3]]
[1] 5

$w[[4]]
[1] 6


dt[, maxLr := r$m ]
#dtnew <- dt[rep(1:.N, sapply(r$w, length)),]
dtnew <- dt[rep(1:.N, lengths(r$w),] # thanks to Frank
dtnew[, cid := dt.ref[unlist(r$w), 1]]

Results:

dtnew
   nid  rname maxr     maxLr cid
1:  n1  apple  0.5 1.0000000  c1
2:  n2   pear  0.8 0.7500000  c3
3:  n2   pear  0.8 0.7500000  c4
4:  n3 banana  0.7 0.3333333  c5
5:  n4   kiwi  0.6 0.8000000  c6
minem
  • 3,640
  • 2
  • 15
  • 29
  • Thanks minem. Very interesting approach. I will test it full scale. – user2590177 Aug 08 '17 at 14:01
  • I am trying it with 75k rows dt.ref and 30k for dt. It is very long. Would it be possible to speed up the process using only data.table and not a matrix? – user2590177 Aug 09 '17 at 09:55
  • It makes the session bugs on RStudio and abort. – user2590177 Aug 09 '17 at 10:12
  • @user2590177 `stringdistmatrix` is calculation dist for each combination, which results into matrix 75k*30k, that is probably very time consuming, also you probably can not store so big matrix into memory. How much RAM do you have? For this big data, you probably need to find a different approach, maybe split the process into parts. I would suggest that you ask a new question with your problem. Describe the data you are working with and add this salutation. – minem Aug 09 '17 at 10:29
  • with 32GB of RAM I thought it would work. I reduced the 30k to 9k. I have to find now how to split the ref or presort the ref so that I can reduce the 75k also. Thanks. – user2590177 Aug 09 '17 at 10:47
  • @user2590177 then probably `apply/sapply` functions are quite slow, maybe there is some way to speed them. Anyway, I suggest that you create a new question, and maybe someone have an idea how to speed up things – minem Aug 09 '17 at 10:52