I'm trying to join two datasets on based on the values of two variables. Both datasets have the same variable names/number of columns but may have a different number of rows. I want to join them based on a grouping variable ("SampleID") and a variable that contains text strings ("PrimConstruct"). I want to use fuzzy matching to account for slight misspellings and differences in capitalizations in "PrimConstruct". There's also the chance that one dataset has a row with a text value that the other dataset does not, in which case I would like it to leave that as a separate row.
df1 <- data.frame(SampleID_a = c("abc0101", "abc0101", "bcd0201",
"bcd0201"), PrimConstruct_a = c("cohesion", "cognition",
"cohesion", "cognition"))
df2 <- data.frame(SampleID_b = c("abc0101", "abc0101", "bcd0201", "bcd0201",
"bcd0201"), PrimConstruct_b = c("cohesion", "cognition",
"commitment", "Cohesion", "cognitiion"))
# df2 has misspelling, different capitalization,
# and entry with no close match
I would like the fuzzy matching to only occur within the same SampleID
, so "cohesion" for abc0101 will not be matched with "cohesion" for bcd0201. I am eventually hoping to have a dataset that keeps both PrimConstruct columns and all the values, but the matched values are in the same row:
desireddf <- data.frame(SampleID_a = c("abc0101", "abc0101", "bcd0201", "bcd0201", "bcd0201"), PrimConstruct_a = c("cohesion", "cognition", "cohesion", "cognition", "NA"), PrimConstruct_a = c("cohesion", "cognition", "Cohesion", "cognitiion", "commitment")
I have tried stringdist_join
but it ends up duplicating and adding a bunch of rows somehow.
joined <- stringdist_join(df1,
df2,
by = c("PrimConstruct_a" = "PrimConstruct_b",
"SampleID_a" = "SampleID_b"),
mode = "full",
method = "jw",
max_dist = 2,
ignore_case = T)
I'm a little confused by the different matching methods, but I don't expect differences in text strings to be much more than minor spelling mistakes and differences in capitalization, which I believe should be covered by ignore_case.