I am currently working on a project and have reached a problem... I am trying to match two data frames based on a candidate's name. I have managed to do this, however with anything more than a max_dist of 2 I start to get duplicate entries. However, these would be easily avoided if I could 'group' the candidates by race (state and district) before running stringdist_join as there are only a few candidates in each race with very little chance of having two candidates with similar names.
The goal is to obtain a table called tmpJoin where I can have both the candidateID and the canVotes, along with the name, state, district.
Any suggestions would be greatly appreciated!
Below is my code as well as a replication of the two datasets
state <- c('AL','AL','AL','AL','AL','NY','NY','NY','NY','NY')
district <-c('01','02','02','03','01','01','02','01','02','02')
FullName <-c('Sonny Callahan','Tom Bevill','Faye Baggiano','Thomas
Bevill','Don Sledge','William Turner', 'Bill Turner','Ed Smith','Tom
Bevill','Edward Smith')
canVotes <-c('234','589','9234','729','149','245','879','385','8712','7099')
yearHouseResult <- data.frame(state, district, FullName,canVotes)
state <- c('AL','AL','AL','AL','AL','NY','NY','NY','NY','NY')
district <-c('01','02','02','03','01','01','02','01','02','02')
FullName <-c('Sonny Callahan','Tom Beville','Faye Baggiano','Thom Bevill','Donald Sledge','Bill Turner', 'Bill Turner','Ed Smith','Tom Bevill','Ed Smith')
candidateID <- c('1','2','3','4','5','6','7','8','9','10')
congrCands <- data.frame(state, district, FullName, candidateID)
tmpJoin <- stringdist_join(congrCands, yearHouseResult,
by = "FullName",
max_dist=2,
method = "osa",
ignore_case = FALSE,
distance_col = "matchingDistance")