0

I'm new to R and I've been trying to fuzzymatch two large datasets without crashing my computer. At first it took so long so I split the data frame into a list and used purrr:map but it's still taking a long time and not working.

So now I'm taking splitting names in both dataset and then loop through list by list.

Let's say I have two datasets.

list.a <- data.frame(name=c("aa","bb","cc","dd","ee","ff","gg") )
list.b <- data.frame(name=c("ab","cb","ff","dd","ee","ff","gg"))

I substring the name by the first string, and then split it by alphabet.

list.a <- list.a %>%
  mutate(id_a=str_sub(name, 1,1))

list.b <- list.b %>%
  mutate(id_a=str_sub(name, 1,1))

list.a <- split(list.a, list.a$id_a)
list.b <- split(list.b, list.b$id_a)

This split function would give me a list of split data by the first letter of name.

Here's the troubling part for me, and I'm not sure what to do here. I'm trying to fuzzymatch by a, b, c, d, of the name (so name that starts with a in both sets, and then moving onto b name, and so on).

I'm trying to fuzzyjoin by 'name' for each list starting with the same alphabet in both dataset.

purrr::map(list.a, ~stringdist_inner_join(x=., y=list.b, 
                                               by="name", 
                                               ignore_case=FALSE,
                                               method="jw",
                                               max_dist=0.25))

my expected output is that once these data sets are joined by fuzzy matching, then would combine them together in the end.

Thanks for any suggestions!

Sun
  • 157
  • 11
  • What is your expected output? – Matt Nov 04 '21 at 01:58
  • Hi Matt, I just edited my post. basically i split up two data by the first letter of name. Then I want to fuzzy join each split data that starts with the same letter of name. I want to do this to reduce time and not get error. Then, the output will be the appended list into a complete dataframe. – Sun Nov 04 '21 at 02:18

1 Answers1

1

I'm not sure what your expected output is, but here is a possible solution using Reduce. This solution doesn't use split like you have in your example.

library(tidyverse)
library(fuzzyjoin)

list.a <- data.frame(name=c("aa","bb","cc","dd","ee","ff","gg") )
list.b <- data.frame(name=c("ab","cb","ff","dd","ee","ff","gg"))

list.a <- list.a %>%
  mutate(id_a=str_sub(name, 1,1))

list.b <- list.b %>%
  mutate(id_a=str_sub(name, 1,1))

join_dfs <- function(df1, df2){
  stringdist_semi_join(x = df1, 
                       y = df2,
                       by = "id_a",
                       ignore_case = F,
                       method = "jw",
                       max_dist = 0.25)
}


all <- list(list.a, list.b)

Reduce(join_dfs, all)

This gives us:

  name id_a
1   aa    a
3   cc    c
4   dd    d
5   ee    e
6   ff    f
7   gg    g
Matt
  • 7,255
  • 2
  • 12
  • 34