Dplyr: Clean double-barrelled surnames

Question

I have a data.frame of names such as the following which has a sample of some surnames followed by an initial (e.g. Smith S or Lopez-Garcia M):

df<-data.frame(names=c("Adu-Amankwah E",
"Smith Dawson E",
"Lopez-Garcia M",
"Lopez Garcia MA",
"Garcia MAC",
"Lopez Garcia MA",
"Garcia MAC"))

I would like to pull out all those names with a double-barreled surname and do a little cleaning:

pull out any with a hyphen (-) or two surnames (e.g. Lopez Garcia).
I need to replace any of the following: Lopez Garcia MA,Lopez-Garcia MA or Garcia MAC with Lopez-Garcia M. And Smith Dawson E should be Smith-Dawson E.

Output would look like:

df<-data.frame(names=c("Adu-Amankwah E",
"Smith-Dawson E",
"Lopez-Garcia M",
"Lopez-Garcia M",
"Lopez-Garcia M",
"Lopez-Garcia M",
"Lopez-Garcia M"))

Do you have just these few names or are the names just a sample of a large number of distinct names? — Chris Ruehlemann, Nov 23 '21 at 16:03
@ChrisRuehlemann this is a small sample, but this is the style I'm finding. — HCAI, Nov 23 '21 at 16:03
I see but what if you have one `Smith E` and one `Dawson E` and one single `Garcia` and so on? — Chris Ruehlemann, Nov 23 '21 at 16:05
The challenge here is not so much *parsing* the `character` strings as it is defining the *logic* to **(1)** associate variants of the same name (`"Garcia MAC"`, `"Lopez Garcia MA"`) under a representative label (`"Lopez-Garcia M"`); and still **(2)** avoid lumping together similar variants of different names (`"Andy Garcia"`). — Greg, Nov 23 '21 at 16:06
Frankly, you might just need a mapping table for several well-known variants of each name, which cover everything in your dataset with the lone exception of punctuation, which you can parse away. — Greg, Nov 23 '21 at 16:07
@Greg thank you, that makes sense. What would you "google" to find out how to do that? I've had a look at fuzzy_join which could be interesting: https://stackoverflow.com/questions/63894177/fuzzy-left-join-person-full-names-in-r-handling-tricky-edge-cases-cannot-inst. Need to evaluate how good it is though. — HCAI, Nov 23 '21 at 16:19
@HCAI You could define by hand a `data.frame` called `mapping` with two columns: `Rep_Name`, containing a single representative name (like `"Lopez-Garcia M"`) for each person; and `Pattern`, containing a [regex](https://rdrr.io/r/base/regex.html) (like `"^(Lopez[- ]?)?Garcia( M(AC?)?)?$"`) for each `Name` that covers all its variants. A [`fuzzyjoin::regex_join()`](https://rdrr.io/cran/fuzzyjoin/man/regex_join.html) should do the trick from there: `fuzzyjoin::regex_left_join(x = df, y = mapping, by = c("names" = "Pattern"))`. — Greg, Nov 23 '21 at 16:34
@HCAI Also, I might *not* recommend a [**`stringdist`**](https://rdrr.io/cran/stringdist/) approach in this situation. Each name varies not only in spelling but also in structure. It's entirely possible that two similarly structured entries, for two *different* people with a popular last name like (say) `"Garcia"`, have a shorter string distance than do two differently structured variations on the *same* name. — Greg, Nov 23 '21 at 16:43

Greg · Accepted Answer · 2021-11-23T22:40:15.940

As I mentioned in my comments, the challenge here is not so much parsing the character strings as it is defining the logic to

associate variants of the same name (ex. "Garcia MAC", "Lopez Garcia MA") under a representative label ("Lopez-Garcia M"); and still
avoid lumping together similar variants (like "Garcia A") of different names (like "Andy Garcia").

As such, your best approach might be to define a mapping table for known variations on the names.

Literal Mapping

A literal mapping involves typing out every known variant, next to the name it truly represents.

mapping_lit <- data.frame(
  True_Name = c("Adu-Amankwah E", "Smith-Dawson E", "Lopez-Garcia M", "Lopez-Garcia M",  "Lopez-Garcia M"),
  Variant   = c("Adu-Amankwah E", "Smith Dawson E", "Lopez-Garcia M", "Lopez Garcia MA", "Garcia MAC")
)

mapping_lit
#>        True_Name         Variant
#> 1 Adu-Amankwah E  Adu-Amankwah E
#> 2 Smith-Dawson E  Smith Dawson E
#> 3 Lopez-Garcia M  Lopez-Garcia M
#> 4 Lopez-Garcia M Lopez Garcia MA
#> 5 Lopez-Garcia M      Garcia MAC

Once you have your mapping, a simple dplyr::*_join() should do the trick

library(dplyr)

# The LEFT JOIN preserves any names without matches, so you can handle them as you wish.
left_join(
  df,
  mapping_lit,
  by = c("names" = "Variant")
)

with the following result:

            names      True_Name
1  Adu-Amankwah E Adu-Amankwah E
2  Smith Dawson E Smith-Dawson E
3  Lopez-Garcia M Lopez-Garcia M
4 Lopez Garcia MA Lopez-Garcia M
5      Garcia MAC Lopez-Garcia M
6 Lopez Garcia MA Lopez-Garcia M
7      Garcia MAC Lopez-Garcia M

Regex Mapping

If you're skilled enough with regular expressions, you could just define one regex to represent all variants on each True_Name:

mapping_rgx <- data.frame(
  True_Name = c("Adu-Amankwah E",             "Smith-Dawson E",             "Lopez-Garcia M"),
  Pattern   = c("^(Adu[- ]?)?Amankwah( E)?$", "^(Smith[- ]?)?Dawson( E)?$", "^(Lopez[- ]?)?Garcia( M(AC?)?)?$")
)

mapping_rgx
#>        True_Name                          Pattern
#> 1 Adu-Amankwah E       ^(Adu[- ]?)?Amankwah( E)?$
#> 2 Smith-Dawson E       ^(Smith[- ]?)?Dawson( E)?$
#> 3 Lopez-Garcia M ^(Lopez[- ]?)?Garcia( M(AC?)?)?$

Once you have this mapping, you'll need a fuzzyjoin::regex_*_join() to match up the variants

library(fuzzyjoin)

# The LEFT JOIN preserves any names without matches, so you can handle them as you wish.
regex_left_join(
  df,
  mapping_rgx,
  by = c("names" = "Pattern"),
  # Account for typos in capitalization.
  ignore_case = TRUE
)

with the following result:

            names      True_Name                          Pattern
1  Adu-Amankwah E Adu-Amankwah E         (Adu[- ]?)?Amankwah( E)?
2  Smith Dawson E Smith-Dawson E         (Smith[- ]?)?Dawson( E)?
3  Lopez-Garcia M Lopez-Garcia M ^(Lopez[- ]?)?Garcia( M(AC?)?)?$
4 Lopez Garcia MA Lopez-Garcia M ^(Lopez[- ]?)?Garcia( M(AC?)?)?$
5      Garcia MAC Lopez-Garcia M ^(Lopez[- ]?)?Garcia( M(AC?)?)?$
6 Lopez Garcia MA Lopez-Garcia M ^(Lopez[- ]?)?Garcia( M(AC?)?)?$
7      Garcia MAC Lopez-Garcia M ^(Lopez[- ]?)?Garcia( M(AC?)?)?$

Warning

As I also commented, I might not recommend a stringdist approach in this situation. Each name varies not only in spelling but also in structure. It's entirely possible that two similarly structured entries for two different people

Variant	True_Name
Garcia A	Andy Garcia
Garcia MAC	Lopez-Garcia M
Lopez-Garcia M	Lopez-Garcia M

have a shorter string distance than do two differently structured variations on the same name:

# Run the full gamut of methods for 'stringdist::stringdist()'.
methods <- c(
  "osa", "lv", "dl", "hamming", "lcs", "qgram",
  "cosine", "jaccard", "jw", "soundex"
)


# Display string distances for variants of the same and of different names:
rbind(
  # Compare different names.
  sapply(X = methods, FUN = function(x) {stringdist::stringdist(
    a = "Garcia MAC", b = "Garcia A",
    method = x
  )}),
  # Compare variations on the same name.
  sapply(X = methods, FUN = function(x) {stringdist::stringdist(
    a = "Garcia MAC", b = "Lopez-Garcia M",
    method = x
  )})
)

#>      osa lv dl hamming lcs qgram     cosine   jaccard         jw soundex
#> [1,]   2  2  2     Inf   2     2 0.08712907 0.2222222 0.06666667       1
#> [2,]   8  8  8     Inf   8     8 0.27831216 0.5333333 0.20952381       1

Thank you very much, this is very comprehensive! I will send some bounty when it's available! — HCAI, Nov 23 '21 at 22:05

Dplyr: Clean double-barrelled surnames

1 Answers1

Literal Mapping

Regex Mapping

Warning