I'm comparing pairwise string distances for 8 million observations on 17 columns.
Because I run into memory issues, I want to ask for help on a sub-setting technique or other methods to overcome this issue.
In a different question on this website, I asked for help to speed up the original code I'd written (based on yet another question). The resulting answer (thanks @alistaire) provided very useful help and increased the speed enormously. However, on the real data, I run out of memory very quickly using this approach.
Consider the following test data with only three variables to compare:
df <- data.frame(names=c("A ADAM", "S BEAN", "A APPLE", "J BOND", "J BOND"),
v1=c("Test_a", "Test_b", "Test_a", "Test_b", "Test_b"),
v2=c("Test_c", "Test_c", "Test_d", "Test_d", "Test_d"))
When I run the following code, I get the desired output and very fast.
map2(df, c('soundex', 'jw', 'jw'), ~stringdist::stringdistmatrix(.x, method = .y)) %>%
map_df(broom::tidy, .id = 'var') %>% spread(var, distance)
# item1 item2 names v1 v2
# 1 2 1 1 0.1111111 0.0000000
# 2 3 1 1 0.0000000 0.1111111
# 3 3 2 1 0.1111111 0.1111111
# 4 4 1 1 0.1111111 0.1111111
# 5 4 2 1 0.0000000 0.1111111
# 6 4 3 1 0.1111111 0.0000000
# 7 5 1 1 0.1111111 0.1111111
# 8 5 2 1 0.0000000 0.1111111
# 9 5 3 1 0.1111111 0.0000000
# 10 5 4 0 0.0000000 0.0000000
But when I run the original data, this approach results in the memory issues. Yet, I would like to use this approach because it is really fast.
Are there any techniques/methods available that apply this code to subsets of the 8 million data.frame such that every row gets compared to every other row in the data.frame?
The system I'm working on has:
12 cores
128GB RAM