2

I'm looking to join two dataframes based on a condition, in this case, that one string is inside another. Say I have two dataframes,

df1 <- data.frame(fullnames=c("Jane Doe", "Mr. John Smith", "Nate Cox, Esq.", "Bill Lee III", "Ms. Kate Smith"), 
                  ages = c(30, 51, 45, 38, 20))

       fullnames ages
1       Jane Doe   30
2 Mr. John Smith   51
3 Nate Cox, Esq.   45
4   Bill Lee III   38
5 Ms. Kate Smith   20

df2 <- data.frame(lastnames=c("Doe", "Cox", "Smith", "Jung", "Smith", "Lee"), 
                  ages=c(30, 45, 20, 28, 51, 38), 
                  homestate=c("NJ", "CT", "MA", "RI", "MA", "NY"))
  lastnames ages homestate
1       Doe   30        NJ
2       Cox   45        CT
3     Smith   20        MA
4      Jung   28        RI
5     Smith   51        MA
6       Lee   38        NY

I want to do a left join on these two dataframes on ages and the row in which df2$lastnames is contained within df1$fullnames. I thought fuzzy_join might do it, but I don't think it liked my grepl:

joined_dfs <- fuzzy_join(df1, df2, by = c("ages", "fullnames"="lastnames"), 
+                          match_fun = c("=", "grepl()"),
+                          mode="left")
Error in which(m) : argument to 'which' is not logical

Desired result: a dataframe identical to the first but with a "homestate" column appended. Any ideas?

Jess CT
  • 23
  • 3

2 Answers2

4

TLDR

You just need to fix match_fun:

# ...
match_fun = list(`==`, stringr::str_detect),
# ...

Background

You had the right idea, but you went wrong in your interpretation of the match_fun parameter in fuzzyjoin::fuzzy_join(). Per the documentation, match_fun should be a

Vectorized function given two columns, returning TRUE or FALSE as to whether they are a match. Can be a list of functions one for each pair of columns specified in by (if a named list, it uses the names in x). If only one function is given it is used on all column pairs.

Solution

A simple correction will do the trick, with further formatting by dplyr. For conceptual clarity, I've typographically aligned the by columns with the functions used to match them:

library(dplyr)

# ...
# Existing code
# ...

joined_dfs <- fuzzy_join(
  df1, df2,

  by        =       c("ages", "fullnames" = "lastnames"),
  #                   |----|  |-----------------------|
  match_fun =    list(`==`  , stringr::str_detect      ),
  #                   |--|    |-----------------|
  #   Match by equality ^      ^ Match by detection of `lastnames` in `fullnames`    

  mode = "left"
) %>%
  # Format resulting dataset as you requested.
  select(fullnames, ages = ages.x, homestate)

Result

Given your sample data reproduced here

df1 <- data.frame(
  fullnames = c("Jane Doe", "Mr. John Smith", "Nate Cox, Esq.", "Bill Lee III", "Ms. Kate Smith"),
  ages = c(30, 51, 45, 38, 20)
)

df2 <- data.frame(
  lastnames = c("Doe", "Cox", "Smith", "Jung", "Smith", "Lee"),
  ages = c(30, 45, 20, 28, 51, 38),
  homestate = c("NJ", "CT", "MA", "RI", "MA", "NY")
)

this solution should produce the following data.frame for joined_dfs, formatted as requested:

        fullnames ages homestate
1       Jane Doe   30        NJ
2 Mr. John Smith   51        MA
3 Nate Cox, Esq.   45        CT
4   Bill Lee III   38        NY
5 Ms. Kate Smith   20        MA

Note

Because each ages is coincidentally a unique key, the following join on only *names

fuzzy_join(
  df1, df2,
  by = c("fullnames" = "lastnames"),
  match_fun = stringr::str_detect,
  mode = "left"
)

will better illustrate the behavior of matching on substrings:

       fullnames ages.x lastnames ages.y homestate
1       Jane Doe     30       Doe     30        NJ
2 Mr. John Smith     51     Smith     20        MA
3 Mr. John Smith     51     Smith     51        MA
4 Nate Cox, Esq.     45       Cox     45        CT
5   Bill Lee III     38       Lee     38        NY
6 Ms. Kate Smith     20     Smith     20        MA
7 Ms. Kate Smith     20     Smith     51        MA

Where You Went Wrong

Error in Type

The value passed to match_fun should be either (the symbol for) a function

fuzzyjoin::fuzzy_join(
  # ...
  match_fun = grepl
  # ...
)

or a list of such (symbols for) functions:

fuzzyjoin::fuzzy_join(
  # ...
  match_fun = list(`=`, grepl)
  # ...
)

Instead of providing a list of symbols

match_fun = list(=, grepl)

you incorrectly provided a vector of character strings:

match_fun = c("=", "grepl()")

Error in Syntax

The user should name the functions

`=`
grepl

yet you incorrectly attempted to call them:

=
grepl()

Naming them will pass the functions themselves to match_fun, as intended, whereas calling them will pass their return values*. In R, an operator like = is named using backticks: `=`.

* Assuming the calls didn't fail with errors. Here, they would fail.

Inappropriate Functions

To compare two values for equality, here the character vectors df1$fullnames and df2$lastnames, you should use the relational operator ==; yet you incorrectly supplied the assignment operator =.

Furthermore grepl() is not vectorized in quite the way match_fun desires. While its second argument (x) is indeed a vector

a character vector where matches are sought, or an object which can be coerced by as.character to a character vector. Long vectors are supported.

its first argument (pattern) is (treated as) a single character string:

character string containing a regular expression (or character string for fixed = TRUE) to be matched in the given character vector. Coerced by as.character to a character string if possible. If a character vector of length 2 or more is supplied, the first element is used with a warning. Missing values are allowed except for regexpr, gregexpr and regexec.

Thus, grepl() is not a

Vectorized function given two columns...

but rather a function given one string (scalar) and one column (vector) of strings.

The answer to your prayers is not grepl() but rather something like stringr::str_detect(), which is

Vectorised over string and pattern. Equivalent to grepl(pattern, x).

and which wraps stringi::stri_detect().

Note

Since you're simply trying to detect whether a literal string in df1$fullnames contains a literal string in df2$lastnames, you don't want to accidentally treat the strings in df2$lastnames as regular expression patterns. Now your df2$lastnames column is statistically unlikely to contain names with special regex characters; with the lone exception of -, which is interpreted literally outside of [], which are very unlikely to be found in a name.

If you're still worried about accidental regex, you might want to consider alternative search methods with stringi::stri_detect_fixed() or stringi::stri_detect_coll(). These perform literal matching, respectively by either byte or "canonical equivalence"; the latter adjusts for locale and special characters, in keeping with natural language processing.

Greg
  • 3,054
  • 6
  • 27
  • This is great, thank you! I couldn't find many examples of fuzzy_join and wasn't at all clear on the match_fun. – Jess CT Oct 15 '21 at 18:09
  • @JessCT Great to hear! If this solved your problem, kindly upvote it and mark it as "accepted". :) – Greg Oct 15 '21 at 18:10
  • And THANK YOU for this long, detailed, informative response! – Jess CT Oct 15 '21 at 18:17
  • It's telling me I don't have enough reputation to upvote, so I'm not sure how to mark it as accepted? If one more person upvotes my question I should have enough "reputation". – Jess CT Oct 15 '21 at 18:19
  • Thanks Jess, my pleasure! – Greg Oct 15 '21 at 18:29
  • As much as I like this solution it has its faults: e.g., if one a of the `fullnames` is `Smithsonian`it will be joined with `Smith`! I guess to prevent that `\\b` anchors will have to be built into the code somewhere! – Chris Ruehlemann Oct 15 '21 at 19:19
  • Thanks for the heads up, @ChrisRuehlemann. I'll look into it! – Greg Oct 15 '21 at 19:41
  • @ChrisRuehlemann Technically, the answer currently fulfills OP's stipulation of the _" condition...that one string is inside another"_. However, one could always transform `lastnames` into a sufficient regex via `mutate(lastnames = paste0(lastnames, "$"))`, to enforce that the `fullname` end with the `lastname`. – Greg Oct 15 '21 at 19:57
  • No that won't work as the `fullnames`do **not** always end with the `lastnames`! – Chris Ruehlemann Oct 16 '21 at 09:46
  • @ChrisRuehlemann Good catch! You did properly account for cases like those in your solution, so credit to you. I think OP will need to decide which criterion is better for their purposes: **(1)** that `lastname` be contained in `fullname`; or **(2)** that `fullname` be reducible to `lastname` by certain heuristics. Either way, with (say) `"Kiefer William Frederick Dempsey George Rufus Sutherland"`, it's nearly impossible to implement logic that will pick out **(1)** whether or not the person has multiple first names, multiple middle names, and multiple last names; and **(2)** which are which. – Greg Oct 16 '21 at 16:11
2

This seems to work given your two dataframes:

Edited as per comment by @Greg:

The code is adpated to the data as posted; if in your actual data, there are more variants expecially to last names, such as not only III but also IV, feel free to adapt the code accordingly:

library(dplyr)
df1 %>%
  mutate(
    # create new column that gets rid of strings after last name:
    lastnames = sub("\\sI{1,3}$|,.+$", "", fullnames),
    # grab last names:
    lastnames = sub(".*?(\\w+)$", "\\1", lastnames)) %>%
  # join the two dataframes:
  left_join(., df2, by = c("lastnames", "ages"))
       fullnames ages lastnames homestate
1       Jane Doe   30       Doe        NJ
2 Mr. John Smith   51     Smith        MA
3 Nate Cox, Esq.   45       Cox        CT
4   Bill Lee III   38       Lee        NY
5 Ms. Kate Smith   20     Smith        MA

If you want lastnamesremoved just append this after %>%:

select(-lastnames) 

EDIT #2:

If you don't trust the above solution given massive variation in how last names are actually noted, then of course fuzzy_join is an option too. BUT, the current fuzzy_join solution is not enough; it needs to be amended by one critical data transformation. This is because str_detect detects whether a string is contained within another string. That is, it will return TRUE if it compares, for example, Smith to Smithsonian or to Hammer-Smith - each time the string Smith is indeed contained in the longer names. If, as will likely be the case in a large dataset, Smith and Smithsonian happen to have the same ages the mismatch will be perfect: fuzzy_join will incorrectly join the two. The same problem arises when you have, e.g., Smith and Smith-Klein of the same age: there too fuzzy_join will join them.

The first set of problematic cases can be resolved by including word boundary achors \\b in df2. These assert that, for example, Smith must be bounded by word boundaries to either side, which is not the case with Smithsonian, which does have an invisible boundary to the left of Smithsonian but the right-hand anchor is after its last letter n. The second set of problematic cases can be addressed by including a negative lookahead after \\b, namely \\b(?!-), which asserts that after the word boundary there must not be a hyphen.

The solution is easily implemented with mutate and paste0 like so:

fuzzy_join(
  df1, df2 %>%
    mutate(lastnames = paste0("\\b", lastnames, "\\b(?!-)")),
  by  = c("ages", "fullnames" = "lastnames"),
  match_fun = list(`==`, str_detect),
  mode = "left"
) %>%
  select(fullnames, ages = ages.x, homestate) 
Chris Ruehlemann
  • 20,321
  • 4
  • 12
  • 34
  • OP is using these two criteria for a match: **(1)** that the `ages` are identical; and **(2)** that the `lastname` is a substring of the `fullname`. Indeed, `ages` and `lastnames` serve as a "key" to uniquely identify otherwise ambiguous rows like `"Smith"`. I suspect OP deliberately chose `"Mr. John Smith"` and `"Ms. Kate Smith"` to illustrate this point. However, this answer `*_join`s on the names alone, and as such, it includes extra, inaccurate rows for the `"Smith"`s. _"Desired result: a dataframe identical to the first [`df1`] but with a "homestate" column appended."_ – Greg Oct 15 '21 at 14:10
  • Not to be a critic, but OP indicated `df1` and `df2` were merely illustrations of richer datasets: _"**Say** I have two dataframes"_. Yet this answer makes inextensible assumptions about last names, especially that none are [double-barreled](https://en.wikipedia.org/wiki/Double-barrelled_name). Notice how your last `mutate()` would fail in that case: `sub(".*?(\\w+)$", "\\1", c("Sacha Baron Cohen", "Catherine Zeta-Jones"))` gives us `c("Cohen", "Jones")`. It really accounts for only a few possible suffixes: `"I"`, `"II"`, `"III"`, or anything preceded by a comma; so `"IV"` would be overlooked – Greg Oct 15 '21 at 14:55
  • I appreciate your level of detail and rigor of observation. But why not let OP judge? – Chris Ruehlemann Oct 15 '21 at 15:05
  • Oh naturally, it is up to OP to determine which answer best addresses their question! My second comment was more of a musing on extensibility, and it's hardly a "certainty" of any sort. Apologies if I came across harshly. As you can see from my answer, I'm quite verbose, and it's hard to cram both content and accurate tone into 600 characters or less; so I appreciate your polite response. – Greg Oct 15 '21 at 15:12
  • 1
    As for my first comment, I do think it still stands: the resulting output is not what OP requested, due to a missing `*_join` criterion. However, that can easily be fixed by `left_join`ing on `by = c("lastnames", "ages")`. :) – Greg Oct 15 '21 at 15:15
  • 1
    I've edited my solution. Thanks for the tip! – Chris Ruehlemann Oct 15 '21 at 15:48
  • 1
    Happy to help, and I appreciate the shout out! :) – Greg Oct 15 '21 at 15:57
  • Yes, I like this solution too, but unfortunately my real data is crazy. It's government representatives and in one dataset it's written "John H Smith, D-NJ" (but not consistently, sometimes written out differently with no or different punctuation), and the other dataset is just last name, and it's 10k rows. – Jess CT Oct 15 '21 at 18:16
  • I understand. But I guess it would be possible to develop rules for the name variation and to come up with regex patterns that follow these rules. – Chris Ruehlemann Oct 15 '21 at 18:39
  • Please see my updated solution, which corrects a critical omission in the `fuzzy_join` solution. – Chris Ruehlemann Oct 16 '21 at 07:55
  • @ChrisRuehlemann Great job integrating our contributions into a more complete solution! – Greg Oct 16 '21 at 16:20