Comparing two columns of two dataframes based on partial string match

Question

I have two sample data frames, df1 and df2 as given below. df1 has the list of selected tennis match fixtures with player names(player1_name,player_name2) and the date they were played. Full names are used here for players.

df2 has the list of all tennis match results(winner, loser) for each date. Here, the first letter of first names and full last names are used. The player names for fixtures and for results were scraped from different websites. So there could be some cases where last names may not exactly match. Taking this into consideration, I would like to add a new column to df1 that says if player1 or player2 won. Basically, I would want to map player1_name and player2_name from df1 to winner and loser from df2 by some means of partial matching given the same date.

dput(df1)
structure(list(date = structure(c(18534, 18534, 18534, 18534, 
18534, 18534, 18534), class = "Date"), player1_name = c("Laslo Djere", 
"Hugo Dellien", "Quentin Halys", "Steve Johnson", "Henri Laaksonen", 
"Thiago Monteiro", "Andrej Martin"), player2_name = c("Kevin Anderson", 
"Ricardas Berankis", "Marcos Giron", "Roberto Carballes", "Pablo Cuevas", 
"Nikoloz Basilashvili", "Joao Sousa")), row.names = c(NA, -7L
), class = "data.frame")

dput(df2)
structure(list(date = structure(c(18534, 18534, 18534, 18534, 
18534, 18534, 18534, 18534, 18534, 18534, 18534, 18534, 18534, 
18534, 18534, 18534, 18534, 18534, 18534, 18534), class = "Date"), 
    winner = c("L Harris", "M Berrettini", "M Polmans", "C Garin", 
    "A Davidovich Fokina", "D Lajovic", "K Anderson", "R Berankis", 
    "M Giron", "A Rublev", "N Djokovic", "R Carballes Baena", 
    "A Balazs", "P Cuevas", "T Monteiro", "S Tsitsipas", "D Shapovalov", 
    "G Dimitrov", "R Bautista Agut", "A Martin"), loser = c("A Popyrin", 
    "V Pospisil", "U Humbert", "P Kohlschreiber", "H Mayot", 
    "G Mager", "L Djere", "H Dellien", "Q Halys", "S Querrey", 
    "M Ymer", "S Johnson", "Y Uchiyama", "H Laaksonen", "N Basilashvili", 
    "J Munar", "G Simon", "G Barrere", "R Gasquet", "J Sousa"
    )), row.names = c(NA, -20L), class = "data.frame")

I have created a custom function that can match a string to it’s closest match from a string vector using RecordLinkage package. I could possibly write a super inefficient code using this function but before going there, I’d want to see if I can do it in a more efficient manner.

ClosestMatch <- function(string, stringVector,max_threshold=0.5) {
        df<- character()
        for (i in 1:length(string)) {
                distance <- levenshteinSim(string[i], stringVector)
                if (max(distance)>=max_threshold) {
                        df[i]<- stringVector[which.max(distance)]
                }
                else {
                        df[i]= NA
                }
        }  
        return(df)
}

NovaEthos · Accepted Answer · 2020-10-01T20:06:18.207

I gave it a go using stringdist:

library(stringdist)

for (i in 1:nrow(df1)) {
  
  #this first part combines the names of player1 and player2
  #and finds the closest match to the player combinations in df2

  d <-
    stringdist(
      paste(df1$player1_name[i], df1$player2_name[i]),
      paste(df2$winner, df2$loser),
            method = "cosine")
  #I like using the cosine method as it returns a decimal as opposed to an integer


  #then, added winner and loser columns to df1 based on which row in df2 had the closest match
  #(i.e. lowest stringdist)
 
  df1$winner[i] <- df2[which(d == min(d)), 2]
  df1$loser[i] <- df2[which(d == min(d)), 3]
}

#adding another loop that makes the names in the winner/loser columns
#change to their closest match in the player1 and player2 columns

for(i in 1:nrow(df1)){
  n <- stringdist(df1$winner[i], c(df1$player1_name[i], df1$player2_name[i]), method = "cosine")
  if (n[1] > n[2]){df1$winner[i] <- df1$player2_name[i]
                   df1$loser[i] <- df1$player1_name[i]}
  if (n[1] < n[2]){df1$winner[i] <- df1$player1_name[i]
                   df1$loser[i] <- df1$player2_name[i]}
}

> df1
        date    player1_name         player2_name            winner                loser
1 2020-09-29     Laslo Djere       Kevin Anderson    Kevin Anderson          Laslo Djere
2 2020-09-29    Hugo Dellien    Ricardas Berankis Ricardas Berankis         Hugo Dellien
3 2020-09-29   Quentin Halys         Marcos Giron      Marcos Giron        Quentin Halys
4 2020-09-29   Steve Johnson    Roberto Carballes Roberto Carballes        Steve Johnson
5 2020-09-29 Henri Laaksonen         Pablo Cuevas      Pablo Cuevas      Henri Laaksonen
6 2020-09-29 Thiago Monteiro Nikoloz Basilashvili   Thiago Monteiro Nikoloz Basilashvili
7 2020-09-29   Andrej Martin           Joao Sousa     Andrej Martin           Joao Sousa

And not that anyone asked, but I think Rublev is going to reach at least the semifinals of RG, he's been on fire lately. — NovaEthos, Oct 01 '20 at 19:31
Thanks! This is great. I think you have not taken date into account but that's trivial. I can easily incorporate that. I'm also thinking to return the values from df1 for winner and loser columns so that I can see their full names. It could be a little extra effort but I think I can figure it out. — Mohamad Sahil, Oct 01 '20 at 19:45
Rublev barely escaped defeat against Querrey in the first round but I do agree that he will go deep into the tournament. — Mohamad Sahil, Oct 01 '20 at 19:48
@MohamadSahil Sorry, missed that in the OP. I added another loop that makes the players' name full by determining which name of the two is closest to the selected winner/loser, then replacing that. It's a little clunky so there's probably a better way to do it. And yeah I think Querrey seems to play to the level of whoever he's playing, he's in a lot of five-setters. — NovaEthos, Oct 01 '20 at 20:09

Comparing two columns of two dataframes based on partial string match

1 Answers1