joining on inexact strings in R

Question

I am looking to join two tables.. however the data I am looking to join on does not match exactly.. joining on NFL player names..

data sets below..

> dput(att75a)
structure(list(rusher_player_name = c("A.Ekeler", "A.Jones", 
"A.Kamara", "A.Mattison", "A.Peterson", "B.Hill"), mean_epa = c(-0.110459963350783, 
0.0334332018597805, -0.119488111742492, -0.155261835310445, -0.123485646124451, 
-0.0689611296359916), success_rate = c(0.357664233576642, 0.40495867768595, 
0.401129943502825, 0.283018867924528, 0.322727272727273, 0.35
), plays = c(137L, 242L, 177L, 106L, 220L, 80L)), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -6L))

> dput(rb2019capa)
structure(list(rusher_player_name = c("Aaron Jones", "Adrian Peterson", 
"Alexander Mattison", "Alvin Kamara", "Austin Ekeler", "Brian Hill"
), Team = c("Packers", "Redskins", "Vikings", "Saints", "Chargers", 
"Falcons"), `Salary Cap Value` = c(695487, 1780000, 700545, 1050693, 
646668, 645000), `Cash Spent` = c(645000, 2530000, 1317180, 807500, 
645000, 645000)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-6L))

for example I am trying to join A.Mattison on Alexander Mattison.. and so on..

i experimented with stringdist and fuzzyjoin but could not solve my problem..

please consider... took the head() of each dataset to condense per question asking guidelines.. original data sets have lengths of 51 obs. and 168 obs... will that affect how the join is performed?

What is the best way to go about cleaning these names?

thank you for your time..

Would a code that converts `Alexander Mattison` to `A.Mattison` do the job? — Kevin Cazelles, Feb 20 '20 at 23:34
Modify the second df by replacing first name with first initial. As long as there are no duplicates it should be possible to join this with the first df exactly. — Edward, Feb 20 '20 at 23:34
in another post (should not have double posted) I was advised to use sub.. ```sub("(.).*\\.(.*)", "\\1.\\2", rb2019capa$rusher_player_name)```.. the join was successful after matching the names... I am going to experiment answers below when i get back into my environment later... — sbarbarotta, Feb 21 '20 at 15:56

score 0 · Answer 1 · answered Feb 21 '20 at 02:37

Replace the dot with % making an SQL pattern and join based on a match to it.

library(sqldf)

sqldf("select * 
  from att75a a 
  left join rb2019capa r 
    on r.rusher_player_name like replace(a.rusher_player_name, '.', '%')")

giving:

  rusher_player_name    mean_epa success_rate plays rusher_player_name..5
1           A.Ekeler -0.11045996    0.3576642   137         Austin Ekeler
2            A.Jones  0.03343320    0.4049587   242           Aaron Jones
3           A.Kamara -0.11948811    0.4011299   177          Alvin Kamara
4         A.Mattison -0.15526184    0.2830189   106    Alexander Mattison
5         A.Peterson -0.12348565    0.3227273   220       Adrian Peterson
6             B.Hill -0.06896113    0.3500000    80            Brian Hill
      Team Salary Cap Value Cash Spent
1 Chargers           646668     645000
2  Packers           695487     645000
3   Saints          1050693     807500
4  Vikings           700545    1317180
5 Redskins          1780000    2530000
6  Falcons           645000     645000

Edward · Accepted Answer · 2020-02-21T06:35:42.397

Use sub to replace first name with first initial.

library(dplyr)

rb2019capa %>%
  mutate(rusher_player_name=
         sub("^([A-Z])\\S+\\s([A-Za-z].*)$", "\\1.\\2", rusher_player_name)) %>%
  inner_join(att75a, by="rusher_player_name") # or left_join (up to you)

# A tibble: 6 x 7
  rusher_player_name Team     `Salary Cap Value` `Cash Spent` mean_epa success_rate plays
  <chr>              <chr>                 <dbl>        <dbl>    <dbl>        <dbl> <int>
1 A.Jones            Packers              695487       645000   0.0334        0.405   242
2 A.Peterson         Redskins            1780000      2530000  -0.123         0.323   220
3 A.Mattison         Vikings              700545      1317180  -0.155         0.283   106
4 A.Kamara           Saints              1050693       807500  -0.119         0.401   177
5 A.Ekeler           Chargers             646668       645000  -0.110         0.358   137
6 B.Hill             Falcons              645000       645000  -0.0690        0.35     80

joining on inexact strings in R

2 Answers2

Linked