I'm new to R and I've been trying to fuzzymatch two large datasets without crashing my computer. At first it took so long so I split the data frame into a list and used purrr:map
but it's still taking a long time and not working.
So now I'm taking splitting names in both dataset and then loop through list by list.
Let's say I have two datasets.
list.a <- data.frame(name=c("aa","bb","cc","dd","ee","ff","gg") )
list.b <- data.frame(name=c("ab","cb","ff","dd","ee","ff","gg"))
I substring the name by the first string, and then split it by alphabet.
list.a <- list.a %>%
mutate(id_a=str_sub(name, 1,1))
list.b <- list.b %>%
mutate(id_a=str_sub(name, 1,1))
list.a <- split(list.a, list.a$id_a)
list.b <- split(list.b, list.b$id_a)
This split function would give me a list of split data by the first letter of name.
Here's the troubling part for me, and I'm not sure what to do here. I'm trying to fuzzymatch by a, b, c, d, of the name (so name that starts with a in both sets, and then moving onto b name, and so on).
I'm trying to fuzzyjoin by 'name' for each list starting with the same alphabet in both dataset.
purrr::map(list.a, ~stringdist_inner_join(x=., y=list.b,
by="name",
ignore_case=FALSE,
method="jw",
max_dist=0.25))
my expected output is that once these data sets are joined by fuzzy matching, then would combine them together in the end.
Thanks for any suggestions!