Distance/Fuzzy matching 2 columns with another 2 columns in R

Question

in my simplified example I have a dataframe with four different columns. I want to be able to match main_name and main_dob together with secondary_name and secondary_dob. The actual order of the rows doesn't matter, so if there is a match in row 3 and row 4, I would want them to return the same value and show that there is a match there.

Below is my sample data.

main_name <- c("Arthur Lee", "Robert Frost", "Sarah Doe", "Elizabeth Smith")
main_dob <- c("3/3/93", "10/21/70", "11/25/88", "4/2/92")

secondary_name <- c("David Lee", "Robert L. Frost", "Elizabeth Smith", "Mark Roger")
secondary_dob <- c("4/4/95", "10/21/70", "4/2/92", "11/25/88")

df <- data.frame(main_name,main_dob,secondary_name,secondary_dob)

I would want the output to show me that Arthur Lee's closest match is David Lee, and the distance between the two, as well as the distance between their birthdays. Following, I would want to see that Robert Frost's match exists, but the distance is a little off since the secondary_name contains his middle name, but the birthday helps me verify it's the same person. Next, there is no Sarah Doe, so I would show whatever is the closest distance match and closest birthday distance. Lastly, I would get Elizabeth Smith to match with Elizabeth Smith even though they are on different rows in the two data.

I am thinking of using the jaro-winkler (jw) package for distance, but am open to any ideas and help.

I think you may have to create a tolerance of some sort. Generally there is forced matching, which means that the algorithm will match to the one with the smallest distance unless you specify that if the distance is greater than a certain amount, it should not be matched. That's just my two cents. — akash87, Jan 14 '20 at 20:47
@akash87 thanks for the reply! I agree that I will need to add a tolerance, but do you have any idea of how I could start the distance matching process? — Joey, Jan 14 '20 at 20:48
I believe there is a package called "fuzzjoin" that could hlep get you started. — akash87, Jan 14 '20 at 20:52

Distance/Fuzzy matching 2 columns with another 2 columns in R

0 Answers0