7

I have a set of 40.000 rows x 4 columns and I need to compare each column to itself in order to find the most closest result or the minimum levenshtein distance. The idea is to get an "almost duplicate" for every row. I have calculated with "adist" but seems too slow. For example, for only one column, 5.000 rows compared to all column dataset, 40.000 rows, takes almost 2 hours. This is, for 4 columns, 8 hours, and for the entire dataset, 32 hours. Is there any faster way to achieve the same? I need it to be in 1 or 2 hours if possible. This is an example of what have I done so far:


    #vector example
    a<-as.character(c("hello","allo","hola"))
    b<-as.character(c("hello","allo","hola"))
    
    #execution time
    start_time <- Sys.time()
    
    #Matrix with distance
    dist.name<-adist(a,b, partial = TRUE, ignore.case = TRUE)
    
    #time elapsed
    end_time <- Sys.time()
    end_time - start_time
    
    Output:
    Time difference of 5.873202 secs
    
    #result
    dist.name
          [,1] [,2] [,3]
    [1,]    0    4    5
    [2,]    2    0    2
    [3,]    5    4    0

desired output (minimum distance for every row, but no for the same row), but I need it faster.

[1,] 4
[2,] 2
[3,] 4
Community
  • 1
  • 1
ecp
  • 319
  • 1
  • 6
  • 18
  • 1
    Don't know if it applies to your case, but if you know there are some exact matches I would get rid of those first by doing `which(a%in%b)` and then run the code for Levenshtein distances on the rest – boski May 10 '19 at 07:07
  • 2
    It might be worth checking out `fuzzyjoin`: using `fuzzyjoin::stringdist_inner_join(df, df)` you can match each row in `df` with its closest neighbours based on 1 or more columns. It uses `stringdist` to do the actual distance calculation, so Humpelstielzchen's answer is definitely the place to start. – Marius May 10 '19 at 07:24
  • And what happens if there is more than one result with the same distance? – ecp May 10 '19 at 07:26
  • It would match the row with both neighbours at first, you'd have to figure out a strategy to filter down to just 1 match per row. – Marius May 10 '19 at 07:27
  • I will try it, this may even improve the performance I need. Does fuzzyjoin have any parameter to filter that results must be the least but <0, in order to avoid the same result? – ecp May 10 '19 at 07:38

2 Answers2

9

You could try stringsdist-package.

It's written in C, uses parallel processing and offers various distance metrics, including levenshtein-distance.

library(stringdist)

a<-as.character(c("hello","allo","hola"))
b<-as.character(c("hello","allo","hola"))

start_time <- Sys.time()
res <- stringdistmatrix(a,b, method = "lv")
end_time <- Sys.time()

> end_time - start_time
Time difference of 0.006981134 secs
> res
     [,1] [,2] [,3]
[1,]    0    2    3
[2,]    2    0    3
[3,]    3    3    0


diag(res) <- NA
apply(res, 1, FUN = min, na.rm = T)
[1] 2 2 3
Humpelstielzchen
  • 6,126
  • 3
  • 14
  • 34
  • This really improve performance. How can I pick the minimum result per row and his associated string, excluding the same value? – ecp May 10 '19 at 07:24
  • Excellent -- I was going to suggest the OP write an Rcpp block, but this saves him a lot of work. – Carl Witthoft May 10 '19 at 11:36
  • Once I get the matrix, how can I proceed to sort for every row or column? my dataframe is 134781 obs. of 176 variables and colnames() and rownames() are in the form of V**. – marine8115 Apr 21 '20 at 21:09
  • why would you wanna sort a distance matrix? – Humpelstielzchen Apr 22 '20 at 05:14
2

I have written an R package, zoomerjoin, which uses allows you to fuzzily join large datasets without having to compare all pairs of rows between the two dataframes. This means that you can merge moderately-sized (millions of rows)dataframes in seconds or minutes on a modern data-science laptop without running out of memory.

Here's how I would use the package to join these dataframes:

devtools::install_github("beniaminogreen/zoomerjoin")
library(zoomerjoin)

a<-data.frame(string = c("hello","allo","hola"), id_1 = 1:3)
b<-data.frame(string = c("hello","allo","hola"), id_2 = 1:3)

jaccard_inner_join(a,b)
#   string.x id_1 string.y id_2
# 1     allo    2     allo    2
# 2     hola    3     hola    3
# 3    hello    1    hello    1

This would give you a dataframe of close pairs, which you could then use stringdist on to find the one closest match for each, if I understand your problem correctly.

I've used the package to fuzzy-join datasets with hundreds of millions of rows in a matter of minutes, so it should be able to make quick work of a data frame with 40k observations.