0

Aim:

I'm trying to compare each element in a list with all the other elements below it using Levenshtein distance from this package stringsim to find text that is similar.

Obstacle:

The problem is that due to the time and space complexity, it will take much time to run. This is the complexity for a 5 element array, ending in 10 comparisons/iterations (4+3+2+1):

enter image description here

The calculator and theory can be found here link

Attempt:

I'll reproduce using a normal for loop.

fruits <- fruit[1:5] # 5 elements from fruit
n <- len(fruits) # n set to 5
score_df <- data_frame(x=character(0),y=character(0),score=numeric(0)) # initialize an a matrix to host the strings compare and the score

cnt=0 # Count, for counting the how many iterations ran
i=j=0 
for(i in 1:(n-1)){
  print(i)
  print('----')
  for(j in i+1:(n-i)){
  cnt = cnt+1
  print(j)
  
  initial_term = fruits[i]  # First element
  compared_term = fruits[j] # second element beneath it
  score <- stringsim(initial_term,compared_term, method = 'lv') # Compute Levenshtein distance
  term <- data_frame(x=initial_term, y=compared_term, score=score) # Adding term to a dataframe
  score_df <- bind_rows(score_df, term) # Appending rows to a dataframe
  
  }
  print('====')
}
print(paste('operations count: ', cnt)) # Print the iterations count

You can see the result appears correctly of the 10 elements compared:

> as_tibble(fruits)
# A tibble: 5 x 1
  value      
  <chr>      
1 apple      
2 apricot    
3 avocado    
4 banana     
5 bell pepper

> score_df
# A tibble: 10 x 3
   x       y            score
   <chr>   <chr>        <dbl>
 1 apple   apricot     0.286 
 2 apple   avocado     0.143 
 3 apple   banana      0.167 
 4 apple   bell pepper 0.273 
 5 apricot avocado     0.143 
 6 apricot banana      0     
 7 apricot bell pepper 0.0909
 8 avocado banana      0.143 
 9 avocado bell pepper 0     
10 banana  bell pepper 0.0909

Request:

I was finally able to convert that ordinary loop to a parallelized one. Below is a sample running on this dataset stringr::fruit

I need assistance on optimizing the below loop, so I can run it on ~6k to ~7k rows, if there is one; As my attempt using the below code led my RStudio to crash.

My processor is the below

PS> Get-WmiObject -Class Win32_Processor -ComputerName. | Select-Object -Property Name,NumberOfCores,NumberOfEnabledCore,NumberOfLogicalProcessors,Description


Name                      : Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz
NumberOfCores             : 6
NumberOfEnabledCore       : 6
NumberOfLogicalProcessors : 12
Description               : Intel64 Family 6 Model 158 Stepping 10

PS>
library(foreach)
library(parallel)
library(doParallel)

fruits <- fruit
n <- length(fruits)
score_df <- data_frame(x=character(0),y=character(0),score=numeric(0))

numCores <- detectCores() # 12
registerDoParallel(numCores - 1) # Assigning 11 threads out of 12

i=j=0

score_df <- foreach(i = 1:(n-1), .combine = 'rbind') %:%

 foreach(j = i+1:(n-i), .packages = c("stringdist","tibble","dplyr"), .combine = 'rbind') %dopar% {

  initial_term = fruits[i]
  compared_term = fruits[j]
  score <- stringsim(initial_term,compared_term, method = 'lv')
  term <- data_frame(x=initial_term, y=compared_term, score=score)
  
  }

stopImplicitCluster()

The result was the correct expected number of (3160 rows)

> score_df
# A tibble: 3,160 x 3
   x     y             score
   <chr> <chr>         <dbl>
 1 apple apricot      0.286 
 2 apple avocado      0.143 
 3 apple banana       0.167 
 4 apple bell pepper  0.273 
 5 apple bilberry     0.125 
 6 apple blackberry   0.200 
 7 apple blackcurrant 0.0833
 8 apple blood orange 0.0833
 9 apple blueberry    0.111 
10 apple boysenberry  0.0909
# ... with 3,150 more rows

References:

parallel

https://nceas.github.io/oss-lessons/parallel-computing-in-r/parallel-computing-in-r.html

foreach

https://cran.r-project.org/web/packages/foreach/vignettes/foreach.html

Nested foreach

https://cran.r-project.org/web/packages/foreach/vignettes/nested.html

  • `foreach(j = i+1:(n-i)` should be `foreach(j = (i+1):(n-i)`. – Rui Barradas Jul 26 '21 at 16:38
  • Thank you, but trying that syntax `(i+1)` shouldn't be like that as it will make each element start the comparison from the beginning of the 1st list. i.e. `x->y` `y->x` .. we want to compare them only non-transposal once to reduce the rows. – Mohamed Mostafa El-Sayyad Jul 29 '21 at 05:16
  • 1
    No it shouldn't. It will start the comparisons from the element *after* element `i`. The way in my answer below will not compare twice, try it and see. – Rui Barradas Jul 29 '21 at 05:33

1 Answers1

1

Here are some ideas:

  • Work with the character vector of strings, not with a much slower data.frame;
  • have the inner loop return a named numeric vector instead of the, once again, much, much slower data.frame;
  • there is no need to create two variable in the inner loop, pass the strings to be compared directly to stringsim.

This will return a matrix, not a data.frame. And matrices have faster element access times.
The code will become

library(tidyverse)
library(parallel)
library(foreach)
library(doParallel)

ncores <- detectCores()
registerDoParallel(ncores - 1L) 

fruit <- fruits[["value"]]
n <- nrow(fruits)
score_df <- foreach(i = 1:(n-1), .combine = 'rbind') %:%
  foreach(j = (i+1):n, .packages = c("stringdist","tibble","dplyr"), .combine = 'rbind') %dopar% {
    score <- stringsim(fruit[i], fruit[j], method = 'lv')
    c(initial = i, compared = j, score = score)
  }
stopImplicitCluster()

score_df
#         initial compared      score
#result.1       1        2 0.28571429
#result.2       1        3 0.14285714
#result.3       1        4 0.16666667
#result.4       1        5 0.27272727
#result.1       2        3 0.14285714
#result.2       2        4 0.00000000
#result.3       2        5 0.09090909
#result.1       3        4 0.14285714
#result.2       3        5 0.00000000
#result.4       4        5 0.09090909

class(score_df)
#[1] "matrix" "array" 

Note

You should create a cluster explicitly. I haven't because that would depend on your unstated operating system.

Edit

Function stringsim is vectorized, there is no need for a nested loop. The inner loop can be dealt with by the function.

ncores <- detectCores()
registerDoParallel(ncores - 1L)

score_df2 <- foreach(i = 1:(n - 1),
                     .packages = "stringdist", 
                     .combine = "rbind") %dopar% {
    score <- stringdist::stringsim(fruit[i], fruit[(i + 1):n], method = 'lv')
    cbind(initial = i, compared = (i+1):n, score = score)
  }

stopImplicitCluster()

score_df2
#      initial compared      score
# [1,]       1        2 0.28571429
# [2,]       1        3 0.14285714
# [3,]       1        4 0.16666667
# [4,]       1        5 0.27272727
# [5,]       2        3 0.14285714
# [6,]       2        4 0.00000000
# [7,]       2        5 0.09090909
# [8,]       3        4 0.14285714
# [9,]       3        5 0.00000000
#[10,]       4        5 0.09090909

Data

txt <- "value      
1 apple      
2 apricot    
3 avocado    
4 banana     
5 'bell pepper'"

tc <- textConnection(txt)
fruits <- read.table(tc, header = TRUE)
close(tc)
Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
  • Thank you, It definitely improved the performance. I would like to add that removing the attached packages `dplyr, tidyverse` improved the performance as well. I made it on ~6K rows I got the results after 54 minutes. – Mohamed Mostafa El-Sayyad Jul 29 '21 at 05:59
  • @MohamedMostafaEl-Sayyad Glad to help. And yes, there is no need for all the packages `tidyverse` loads, only `stringdist` is needed. – Rui Barradas Jul 29 '21 at 06:10