1

I'm looking for a way to speed up the following approach. Any pointers are very welcome. Where are the bottlenecks?

Say I have the following data.frame:

df <- data.frame(names=c("A ADAM", "S BEAN", "A APPLE", "J BOND", "J BOND"), 
                      v1=c("Test_a", "Test_b", "Test_a", "Test_b", "Test_b"), 
                      v2=c("Test_c", "Test_c", "Test_d", "Test_d", "Test_d"))

I want to compare each pair of rows in df on their JaroWinkler similarity.

With some help of others (see this post), I've been able to construct this code:

#columns to compare 
testCols <- c("names", "v1", "v2")

#compare pairs
RowCompare= function(x){
 comp <- NULL
 pairs <- t(combn(nrow(x),2))
 for(i in 1:nrow(pairs)){
   row_a <- pairs[i,1]
   row_b <- pairs[i,2]
   a_tests <- x[row_a,testCols]
   b_tests <- x[row_b,testCols]
 comp <- rbind(comp, c(row_a, row_b, TestsCompare(a_tests, b_tests)))
 }

colnames(comp) <- c("row_a","row_b","names_j","v1_j","v2_j")
return(comp)
}

#define TestsCompare
TestsCompare=function(x,y){
names_j <- stringdist(x$names, y$names, method = "jw")
v1_j <-stringdist(x$v1, y$v1, method = "jw")
v2_j <-stringdist(x$v2, y$v2, method = "jw")
c(names_j,v1_j, v2_j)
}

This generates the correct output:

output = as.data.frame(RowCompare(df))

> output
   row_a row_b   names_j      v1_j      v2_j
1      1     2 0.4444444 0.1111111 0.0000000
2      1     3 0.3571429 0.0000000 0.1111111
3      1     4 0.4444444 0.1111111 0.1111111
4      1     5 0.4444444 0.1111111 0.1111111  
5      2     3 0.4603175 0.1111111 0.1111111
6      2     4 0.3333333 0.0000000 0.1111111
7      2     5 0.3333333 0.0000000 0.1111111
8      3     4 0.5634921 0.1111111 0.0000000
9      3     5 0.5634921 0.1111111 0.0000000
10     4     5 0.0000000 0.0000000 0.0000000

However, my real data.frame has 8 million observations and I make 17 comparisons. To run this code takes days...

I am looking for ways to speed up this process:

  • Should I use matrices instead of data.frames?
  • How to parallelize this process?
  • Vectorize?
Community
  • 1
  • 1
wake_wake
  • 1,332
  • 2
  • 19
  • 46
  • 1
    Honestly, it's a lot simpler to compare each column independently: `lapply(df, stringdist::stringdistmatrix, method = 'jw')`. [stringdist has some parallelization built-in, as well,](https://www.rdocumentation.org/packages/stringdist/versions/0.9.4.4/topics/stringdist-parallelization) though you'll need to make sure everything is configured. – alistaire Feb 18 '17 at 22:45
  • 1
    To extend that approach to get back to what you have, `library(tidyverse); df %>% map(stringdist::stringdistmatrix, method = 'jw') %>% map_df(broom::tidy, .id = 'var') %>% spread(var, distance)`. I also recently ran across [multidplyr](http://www.business-science.io/code-tools/2016/12/18/multidplyr.html), though there are lots of ways to parallelize. – alistaire Feb 18 '17 at 22:56
  • @alistaire I like this! Thank you. Is there a way to differentiate between different methods? Say, an exact string match on v1 and a 'soundex' match on v2. Is that possible with `stringdist::stringdistmatrix` ? – wake_wake Feb 19 '17 at 01:22
  • 1
    Sure, just use `purrr::map2` (in base, `Map`) and supply a vector of methods, e.g. `map2(df, c('soundex', 'jw', 'jw'), ~stringdist::stringdistmatrix(.x, method = .y))` – alistaire Feb 19 '17 at 01:28
  • Great!! How can we structure the output into a data.frame? The output from `purrr::map2` are lists, right? – wake_wake Feb 19 '17 at 01:40
  • 1
    Yep. The resulting structure is exactly the same as that from the `lapply` in my first comment, so you can add it to a pipeline just like `map` my second. – alistaire Feb 19 '17 at 01:46
  • @alistaire If I add `map2(df, c('soundex', 'jw', 'jw'), ~stringdist::stringdistmatrix(.x, method = .y))` to the pipeline, I get an error that 'Value column 'distance' does not exist in input'. Where is the value `distance` created? – wake_wake Feb 19 '17 at 02:12
  • 1
    By `broom::tidy.dist`. Altogether, it would be `map2(df, c('soundex', 'jw', 'jw'), ~stringdist::stringdistmatrix(.x, method = .y)) %>% map_df(broom::tidy, .id = 'var') %>% spread(var, distance)`, which works fine for me. – alistaire Feb 19 '17 at 02:23
  • You may be interested in packages such as RecordLinkage. See "Statistical Matching and Record Linkage" section in "OfficialStatistics" CRAN task view – Nat May 14 '18 at 19:07

1 Answers1

2

If you iterate over the variables you want to check, you can make a distance matrix for each with stringdist::stringdistmatrix. Using a form of lapply or purrr::map will return a list of distance matrices (one for each column), which you can in turn iterate over to cal broom::tidy, which will turn them into nicely formatted data.frames. If you use purrr::map_df and use its .id parameter, the results will be coerced into one big data.frame, and the name of each list element will be added as a new column so you can keep them straight. The resulting data.frame will be in long form, so if you want it to match the results above, reshape with tidyr::spread.

If, as you mentioned in the comments, you want to use different methods for different variables, iterate in parallel with map2 or Map.

Altogether,

library(tidyverse)

map2(df, c('soundex', 'jw', 'jw'), ~stringdist::stringdistmatrix(.x, method = .y)) %>% 
    map_df(broom::tidy, .id = 'var') %>% 
    spread(var, distance)

##    item1 item2 names        v1        v2
## 1      2     1     1 0.1111111 0.0000000
## 2      3     1     1 0.0000000 0.1111111
## 3      3     2     1 0.1111111 0.1111111
## 4      4     1     1 0.1111111 0.1111111
## 5      4     2     1 0.0000000 0.1111111
## 6      4     3     1 0.1111111 0.0000000
## 7      5     1     1 0.1111111 0.1111111
## 8      5     2     1 0.0000000 0.1111111
## 9      5     3     1 0.1111111 0.0000000
## 10     5     4     0 0.0000000 0.0000000

Note that while choose(5, 2) returns 10 observations, choose(8000000, 2) returns 3.2e+13 (32 trillion) observations, so for practical purposes, even though this will work much more quickly than your existing code (and stringdistmatrix does some parallelization when possible), the data will get prohibitively big unless you are only working on subsets.

alistaire
  • 42,459
  • 4
  • 77
  • 117
  • Right. Memory becomes an issue. What subset approach would you take that makes sure each pair in the original data.frame gets compared? – wake_wake Feb 21 '17 at 00:02
  • 1
    You can't, unless you've got some spectacular hardware. The next step forward depends a lot on the particulars. Maybe there's a small handful of strings you care whether each observation matches; maybe you're looking for inherent groupings and should use some form of cluster analysis. Some browsing over at [CrossValidated](http://stats.stackexchange.com/) may help lead you down a fruitful path. – alistaire Feb 21 '17 at 01:13