Joining dataframes on text strings using fuzzy string matching (stringdist_join())

Question

I'm trying to join two datasets on based on the values of two variables. Both datasets have the same variable names/number of columns but may have a different number of rows. I want to join them based on a grouping variable ("SampleID") and a variable that contains text strings ("PrimConstruct"). I want to use fuzzy matching to account for slight misspellings and differences in capitalizations in "PrimConstruct". There's also the chance that one dataset has a row with a text value that the other dataset does not, in which case I would like it to leave that as a separate row.

df1 <- data.frame(SampleID_a = c("abc0101", "abc0101", "bcd0201", 
                  "bcd0201"), PrimConstruct_a = c("cohesion", "cognition", 
                  "cohesion", "cognition")) 
df2 <- data.frame(SampleID_b = c("abc0101", "abc0101", "bcd0201", "bcd0201", 
                  "bcd0201"), PrimConstruct_b = c("cohesion", "cognition", 
                  "commitment", "Cohesion", "cognitiion")) 
# df2 has misspelling, different capitalization, 
# and entry with no close match

I would like the fuzzy matching to only occur within the same SampleID, so "cohesion" for abc0101 will not be matched with "cohesion" for bcd0201. I am eventually hoping to have a dataset that keeps both PrimConstruct columns and all the values, but the matched values are in the same row:

desireddf <- data.frame(SampleID_a = c("abc0101", "abc0101", "bcd0201", "bcd0201", "bcd0201"), PrimConstruct_a = c("cohesion", "cognition", "cohesion", "cognition", "NA"), PrimConstruct_a = c("cohesion", "cognition", "Cohesion", "cognitiion", "commitment")

I have tried stringdist_join but it ends up duplicating and adding a bunch of rows somehow.

joined <- stringdist_join(df1,
                         df2,
                         by = c("PrimConstruct_a" = "PrimConstruct_b",
                                "SampleID_a" = "SampleID_b"),
                         mode = "full",
                         method = "jw",
                         max_dist = 2,
                         ignore_case = T)

I'm a little confused by the different matching methods, but I don't expect differences in text strings to be much more than minor spelling mistakes and differences in capitalization, which I believe should be covered by ignore_case.

DPH · Answer 1 · 2023-03-07T16:08:47.930

string distance with multiple colum matches and distinct join function (in your case join equi and one fuzzy join) can be done like this with the fuzzyjoin package:

# dummy data ... I introduced a non matching case
df1 <- data.frame(SampleID_a = c("abc0101", "abc0101", "bcd0201", "bcd0201", "proof"), 
                  PrimConstruct_a = c("cohesion", "cognition", "cohesion", "cognition", "00000")) 
df2 <- data.frame(SampleID_b = c("abc0101", "abc0101", "bcd0201", "bcd0201", "bcd0201"), 
                  PrimConstruct_b = c("cohesion", "cognition", "commitment", "Cohesion", "cognitiion")) 

# build row identification
dplyr::mutate(df1, rn = dplyr::row_number()) %>%
    # perform join
    fuzzyjoin::fuzzy_left_join(df2,
                               # define join columns
                               by=c("SampleID_a" = "SampleID_b",
                                    "PrimConstruct_a" = "PrimConstruct_b"),
                               # list of match functions (first should be clear)
                               match_fun = list(`==`, 
                                                # function which returns boolean vector where maximum allowed string distance is 2 using levenshtein
                                                function(x,y) 
                                                    stringdist::stringdist(x, y, method="lv") < 2)
     )

  SampleID_a PrimConstruct_a rn SampleID_b PrimConstruct_b
1    abc0101        cohesion  1    abc0101        cohesion
2    abc0101       cognition  2    abc0101       cognition
3    bcd0201        cohesion  3    bcd0201        Cohesion
4    bcd0201       cognition  4    bcd0201      cognitiion
5      proof           00000  5       <NA>            <NA>

As you might have noticed the match can return more than one column as possibly there are more matches satisfying the condition. So you could work the data now group wise, which is why I introduced "rn", as you might want onyl the closest match. To show you how you can get down to this, I will leave an optional approach that does not use the fuzzyjoin but the stringdist packge (which fuzzyjoin is based one), as it might help you to understand the workings better:

# build new variable row number to identify df1 row uniuely
res <- dplyr::mutate(df1, rn = dplyr::row_number()) %>%
    # left join by the column that matches exactly (all from df1 and all matching from df2
    dplyr::left_join(df2, by = c("SampleID_a" = "SampleID_b")) %>%
    # stringdist calculus and set result to 0 if NA (no left join match) for next step
    dplyr::mutate(dist = dplyr::coalesce(stringdist::stringdist(PrimConstruct_a,
                                                                PrimConstruct_b), 0)) %>%
    # build grouping by input df1 rows id
    dplyr::group_by(rn) %>%
    # get first row ordered by calculated dist column 
    # this is why dist was set 0 if NA before
    # with_ties false will return only one match even if there are two with the same distance
    dplyr::slice_min(order_by = dist, n = 1, with_ties = FALSE) %>% 
    # release grouping to prevent unwanted behaviour down stream
    dplyr::ungroup() %>%
    # I set dist back to NA where it was (you might already select or perform more calculations
    dplyr::mutate(dist = ifelse(is.na(PrimConstruct_b), NA, dist))

res
# A tibble: 5 × 5
  SampleID_a PrimConstruct_a    rn PrimConstruct_b  dist
  <chr>      <chr>           <int> <chr>           <dbl>
1 abc0101    cohesion            1 cohesion            0
2 abc0101    cognition           2 cognition           0
3 bcd0201    cohesion            3 Cohesion            1
4 bcd0201    cognition           4 cognitiion          1
5 proof      00000               5 NA                 NA

Depending on your use case there are some possible tweaks:

You could use tolower() or toupper() (result is the same) in the matching function of the fuzzyjoin or prior to the join or distance calculation on PrimConstruct_a and PrimConstruct_b, which will solve all stringdistances caused by upper vs lower case and leave only issues of letter diferences to be counted.
Read into the distinct stringdistance functions the stringdist and therefore fuzzyjoin package have implemented. There are some diference which might be helpfull depending on the usecase

EDIT

you are looking for something like a full join, though it is not that easy. One way to solve it is assigning the data from the prio step to a new variable (called it "res") and identify non used cases from df2 to union them back to the result like this:

mis <- df2 %>% 
    # through an antijoin we from df2 to the result we identify non matched cases
    dplyr::anti_join(res, by = c("SampleID_b" = "SampleID_a", 
                                 "PrimConstruct_b")) %>% 
    # format the data to have the same columns as res to be able to unionize/bind them
    dplyr::transmute(SampleID_a = SampleID_b, 
                     PrimConstruct_a = NA, 
                     rn = NA, 
                     PrimConstruct_b,
                     dist = NA)

mis
  SampleID_a PrimConstruct_a rn PrimConstruct_b dist
1    bcd0201              NA NA      commitment   NA


dplyr::union(res, mis)
# A tibble: 6 x 5
   SampleID_a PrimConstruct_a    rn PrimConstruct_b  dist
  <chr>      <chr>           <int> <chr>           <dbl>
1 abc0101    cohesion            1 cohesion            0
2 abc0101    cognition           2 cognition           0
3 bcd0201    cohesion            3 Cohesion            1
4 bcd0201    cognition           4 cognitiion          1
5 proof      00000               5 NA                 NA
6 bcd0201    NA                 NA commitment         NA

score 0 · Answer 2 · answered Mar 16 '23 at 22:36

You might use {powerjoin}:

df1 <- data.frame(SampleID_a = c("abc0101", "abc0101", "bcd0201", 
                                 "bcd0201"), PrimConstruct_a = c("cohesion", "cognition", 
                                                                 "cohesion", "cognition")) 
df2 <- data.frame(SampleID_b = c("abc0101", "abc0101", "bcd0201", "bcd0201", 
                                 "bcd0201"), PrimConstruct_b = c("cohesion", "cognition", 
                                                                 "commitment", "Cohesion", "cognitiion")) 

powerjoin::power_full_join(df1, df2, by = c(SampleID_a = "SampleID_b", ~ stringdist::stringdist(.x$PrimConstruct_a, .y$PrimConstruct_b) < 2))
#>   SampleID_a PrimConstruct_a PrimConstruct_b
#> 1    abc0101        cohesion        cohesion
#> 2    abc0101       cognition       cognition
#> 3    bcd0201        cohesion        Cohesion
#> 4    bcd0201       cognition      cognitiion
#> 5    bcd0201            <NA>      commitment

^{Created on 2023-03-16 with reprex v2.0.2}

Note that I renamed the second col of df2 to PrimConstruct_b

Joining dataframes on text strings using fuzzy string matching (stringdist_join())

2 Answers2