1

I have a data frame containing user data

x <- data.frame("Address_line1" = c("461 road","PO Box 123","543 Highway"), 
                "City" = c("Dallas","Paris","New York" ), "Phone" = c("235","542","842"))
x
  Address_line1     City Phone
1      461 road   Dallas   235
2    PO Box 123    Paris   542
3   543 Highway New York   842

I have a list having the same features in the same order as the dataframe:

y = c("443 road","New york","842")
names(y) = colnames(x)

y

Address_line1          City         Phone 
   "443 road"    "New york"         "842"

I want to iterate through each row of this dataframe, compute the stringdist() of corresponding fields of x with y, sum these values and get a total score for each row.

For example, the score for first row will be:

row_1 = stringdist("461 road","443 road",method="lv") + stringdist("Dallas","New york",method="lv") + stringdist("235","842",method="lv")

row_1
[1] 13

Similarly, I want a score for all rows of the dataframe. This is the code I have written using for loops:

list_dist = rep(NA,0)

for(i in seq_len(nrow(x))){
    list_x = x[i,]
    sum=0
    for(j in seq_len(length(y))){
        sum = sum + stringdist(y[j],list_x[[j]],method = "lv")
    }
    #print(sum)
    list_dist[i] = sum
}


list_dist
[1] 13 18  8

I'm able to get the desired output, but the issue is the computation time. Since my original table contains ~100k rows and 10 columns, it takes close to 30 mins for the code to run. I was wondering if there is a more efficient method to do this.

Rahul
  • 23
  • 2
  • just by first glance, if you know the length of your final output, you can pre-allocate your list object `list_dist` to speed up the for loop. otherwise you could look into parallel processing – EJJ Jun 26 '20 at 16:46
  • Is `y` always a vector or is it a data.frame? – Rui Barradas Jun 26 '20 at 16:47
  • @RuiBarradas `y` can be a data.frame as well, in which case, the score would ideally be a matrix of size nrow(y) x nrow(x) – Rahul Jun 26 '20 at 16:52
  • OK, can you then update the `y` example? – Rui Barradas Jun 26 '20 at 17:03

1 Answers1

0

This is faster.

rowSums(mapply(stringdist, y, x, method = 'lv'))
#[1] 13 18  8

Edit

Here are timings with small x. The functions were timed with package microbenchmark.

Rahul <- function(){
  list_dist = rep(NA,0)

  for(i in seq_len(nrow(x))){
      list_x = x[i,]
      sum=0
      for(j in seq_len(length(y))){
          sum = sum + stringdist(y[j],list_x[[j]],method = "lv")
      }
      #print(sum)
      list_dist[i] = sum
  }
  list_dist
}
Rui <- function(){
  rowSums(mapply(stringdist, y, x, method = 'lv'))
}

library(microbenchmark)

for(i in 1:6) x <- rbind(x,x)
dim(x)
[1] 192  3

mb <- microbenchmark(
  Rui = Rui(),
  Rahul = Rahul()
)

print(mb, unit = 'relative', order = 'median')
#Unit: relative
#  expr      min       lq     mean   median       uq      max neval
#   Rui   1.0000   1.0000   1.0000   1.0000   1.0000   1.0000   100
# Rahul 141.5944 137.4175 133.4313 134.4163 132.2977 119.6172   100

The difference is already of 2 orders of magnutide and it will grow larger as nrow(x) grows.

Edit 2

Following a comment to the question, the function below outputs a matrix nrow(y) x nrow(x) in the case y is a vector or a data.frame.

This function is not the function Rui in the speed test above.

rui <- function(x, y){
  out <- mapply(stringdistmatrix, y, x, MoreArgs = list(method = 'lv'), SIMPLIFY = FALSE)
  Reduce('+', out)
}

z <- data.frame(Address_line1 = c("443 road", "461 road"),
                City = c("New york", "London"), Phone = c("842", "524"))

rui(x, y)
#     [,1] [,2] [,3]
#[1,]   13   18    8

rui(x, z)
#     [,1] [,2] [,3]
#[1,]   13   18    8
#[2,]    9   17   19
Rui Barradas
  • 70,273
  • 8
  • 34
  • 66