Calculating pairwise string distance for big data

Question

I'm comparing pairwise string distances for 8 million observations on 17 columns.

Because I run into memory issues, I want to ask for help on a sub-setting technique or other methods to overcome this issue.

In a different question on this website, I asked for help to speed up the original code I'd written (based on yet another question). The resulting answer (thanks @alistaire) provided very useful help and increased the speed enormously. However, on the real data, I run out of memory very quickly using this approach.

Consider the following test data with only three variables to compare:

df <- data.frame(names=c("A ADAM", "S BEAN", "A APPLE", "J BOND", "J BOND"), 
                  v1=c("Test_a", "Test_b", "Test_a", "Test_b", "Test_b"), 
                  v2=c("Test_c", "Test_c", "Test_d", "Test_d", "Test_d"))

When I run the following code, I get the desired output and very fast.

map2(df, c('soundex', 'jw', 'jw'), ~stringdist::stringdistmatrix(.x, method = .y)) %>% 
map_df(broom::tidy, .id = 'var') %>%  spread(var, distance)

#    item1 item2 names        v1        v2
# 1      2     1     1 0.1111111 0.0000000
# 2      3     1     1 0.0000000 0.1111111
# 3      3     2     1 0.1111111 0.1111111
# 4      4     1     1 0.1111111 0.1111111
# 5      4     2     1 0.0000000 0.1111111
# 6      4     3     1 0.1111111 0.0000000
# 7      5     1     1 0.1111111 0.1111111
# 8      5     2     1 0.0000000 0.1111111
# 9      5     3     1 0.1111111 0.0000000
# 10     5     4     0 0.0000000 0.0000000

But when I run the original data, this approach results in the memory issues. Yet, I would like to use this approach because it is really fast.

Are there any techniques/methods available that apply this code to subsets of the 8 million data.frame such that every row gets compared to every other row in the data.frame?

The system I'm working on has:

12 cores
128GB RAM

You could potentially use the FF package, which keep large objects on disk and has disk-based apply loops. — thc, Feb 23 '17 at 00:16
Isn't the FF package only to read (and store) data that is larger than your RAM? I run out of memory when I try to process the data. — wake_wake, Feb 23 '17 at 05:02
According to the documentation, you can store the results directly to disk. — thc, Feb 23 '17 at 10:48
@jwijffels Is that true? Can the `ff` package be of help here? — wake_wake, Feb 23 '17 at 20:01

Calculating pairwise string distance for big data

0 Answers0