1

I have a large vector of strings like this:

d <- c("herb", "market", "merchandise", "fun", "casket93", "old", "herbb", "basket", "bottle", "plastic", "baskket", "markket", "pasword", "plastik", "oldg", "mahagony", "mahaagoni", "sim23", "asket", "trump" )

I wan't to fetch similar strings for each string from the same vector d.

I am doing this by
1. calculating for each string the edit distance with all other strings strings based on certain rules such as forcing exact matching if any digits are present or if number of alphabet characters are less than 5.
2. putting it in a dataframe dist along with string.
3. subsetting dist based on distances < 3.
4. collapsing and adding the similar strings to original dataframe as a new column.

I am using the stringr and stringdist packages

d <-as.data.frame(d)
M <- nrow(d)
Dist <- data.frame(matrix(nrow=M, ncol=2)) 
colnames(Dist) <- c("string" ,"dist")
Dist$string <- d$d
d$sim <- character(length=M)

require(stringr)
require(stringdist)

for (i in 1:M){
  # if string has digits or is of short size (<5) do exact matching
  if (grepl("[[:digit:]]", d[i, "d"], ignore.case=TRUE) == TRUE || str_count(d[i, "d"], "[[:alpha:]]") < 5){
    Dist$dist <- stringdist(d[i, "d"], d$d, method="lv", maxDist=0.000001) # maxDist as fraction to force exact matching
  # otherwise do approximate matching
  } else  {
    Dist$dist <- stringdist(d[i, "d"], d$d, method="lv", maxDist=3)
  }
  # subset similar strings (with edit distance <3)
  subDist <- subset(Dist, dist < 3 )
  # add to original data.frame d
  d[i, "sim"] <- paste(as.character(unlist(subDist$string)), collapse=", ")
}

Is it possible to vectorise the procedure instead of using a loop? I have a very large vector of strings, so a calculating a distance matrix using stringdistmatrix on the entire vector can't be done due to memory restrictions. The loop works fine for large data, but is very slow.

Henrik
  • 65,555
  • 14
  • 143
  • 159
Crops
  • 5,024
  • 5
  • 38
  • 65

2 Answers2

1

stringdist has a version for computing all the distances in a matrix, so I think that something like this will be an improvement, it's about four times as quick on my computer when run with the 100 reps line included:

d <- c("herb", "market", "merchandise", "fun", "casket93", "old", "herbb", "basket", "bottle", "plastic", "baskket", "markket", "pasword", "plastik", "oldg", "mahagony", "mahaagoni", "sim23", "asket", "trump" )
#d <- rep(d, each=100) #make it a bit longer for timing

d <-as.data.frame(d)
M <- nrow(d)
Dist <- data.frame(matrix(nrow=M, ncol=2))
colnames(Dist) <- c("string" ,"dist")
Dist$string <- d$d
d$sim <- character(length=M)

require(stringr)
require(stringdist)

ind_short <- grepl("[[:digit:]]", d[i, "d"], ignore.case=TRUE) == TRUE | str_count(d$d, "[[:alpha:]]") < 5

short <- stringdistmatrix(d$d[ind_short], d$d, method="lv", maxDist=0.000001)
long <- stringdistmatrix(d$d[!ind_short], d$d, method="lv", maxDist=3)

d$sim[ind_short] <- apply(short,1,function(x)paste(as.character(unlist(d$d[x<3])), collapse=", "))
d$sim[!ind_short] <- apply(long,1,function(x)paste(as.character(unlist(d$d[x<3])), collapse=", "))

The basic strategy is to split into short and long components, and use the matrix form of stringdist, then collapse these using paste, and assign to the right places in your d$sim


Edited to add: in the light of your comment about not being able to work on the whole matrix at once, try choosing chunk_length so that stringdistmatrix() works on a chunk_length*M matrix. Of course, if you set it to 1, you're back to your original unvectorised form

chunk_length <- 100
ind_short <- grepl("[[:digit:]]", d[i, "d"], ignore.case=TRUE) == TRUE | str_count(d$d, "[[:alpha:]]") < 5
d$iter <- rep(1:M,each=chunk_length,length.out=M)

for (i in unique(d$iter))
{
  in_iter <- (d$iter == i)
  short <- stringdistmatrix(d$d[in_iter & ind_short], d$d, method="lv", maxDist=0.000001)
  long <- stringdistmatrix(d$d[in_iter & !ind_short], d$d, method="lv", maxDist=3)

  if(sum(in_iter & ind_short)==1) short <- t(short)
  if(sum(in_iter & !ind_short)==1) long <- t(long)

  if(sum(in_iter & ind_short)>0) d$sim[in_iter & ind_short] <- apply(short,1,function(x)paste(as.character(unlist(d$d[x<3])), collapse=", "))
  if(sum(in_iter & !ind_short)>0) d$sim[in_iter & !ind_short] <- apply(long,1,function(x)paste(as.character(unlist(d$d[x<3])), collapse=", "))
}
Miff
  • 7,486
  • 20
  • 20
  • I had tried stringdistancematrix. Works fine for small number of strings, but my full dataset has over 50,000 strings. So was unable to generate the matrix itself. – Crops May 01 '14 at 13:15
  • 1
    Generally you'll struggle to do much vectorisation where you can't hold the whole matrix in memory, so you'll need to experiment with how big a matrix you can use `stringdistmatrix()` on. Will try and edit the code to take account of this. – Miff May 01 '14 at 13:33
  • 1
    @Crops, I added the issue using `stringdistmatrix` to your question. – Henrik May 01 '14 at 14:44
  • @Miff Wow, the chunk_length solution was clever. Worked for me. Way faster than my crude loop. – Crops May 02 '14 at 04:40
  • I have tried with different chunk_length values and different string sets. I am sometimes receiving the error `4: In d$sim[in_iter & !ind_short] <- apply(long, 1, function(x) paste(as.character(unlist(d$d[x < : number of items to replace is not a multiple of replacement length` – Crops May 02 '14 at 09:03
  • 1
    stringdistmatrix seems to behave inconsistently when the one of the vectors has length1, so I've added a check to the code to catch for this. – Miff May 02 '14 at 09:32
  • Also from stringdist (0.7.0), there is an useBytes option. For ASCII or other single byte encoded characters upto ~100, useBytes=TRUE improves performance. [A bit of benchmarking with string distances](http://www.markvanderloo.eu/yaRb/2013/09/07/a-bit-of-benchmarking-with-string-distances/) – Crops Jul 01 '14 at 09:35
0

It's not really an answer, but I thought it might be good to mention that agrep may be useful for you in this project. It does partial pattern matching.

> d <- c("herb", "market", "merchandise", "fun", "casket93", 
         "old", "herbb", "basket", "bottle", "plastic", "baskket",
         "markket", "pasword", "plastik", "oldg", "mahagony", 
         "mahaagoni", "sim23", "asket", "trump" )
> agr <- sapply(d, function(x) agrep(x, d, value = TRUE))
> head(agr)
$herb
[1] "herb"  "herbb"

$market
[1] "market"  "markket"

$merchandise
[1] "merchandise"

$fun
[1] "fun"

$casket93
[1] "casket93"

$old
[1] "old"     "pasword" "oldg"   
Rich Scriven
  • 97,041
  • 11
  • 181
  • 245