match two datasets with record linkage in R

Question

I am trying to match two datasets in R: datasetA and datasetB. These datasets contain the following columns.

datasetA

ID: 15
Name: peter sanders
First_Name: peter
Last_Name: sanders
ORG_NAME:coffee&cake
City: New York
Amount(USD): 10369
Category: food & beverages
Date: 12/01/2014

datasetB has similar columns:

ORG_ID:5241
names: peter sander
first name: peter
last name: sander
company_name: coffee and cakes
location: New York
funded: 10000
sub_cat: restaurants
start_date: 2013-01-09 16:42:56
end_date: 2015-01-04 11:43:39

The only exact match there is is the first name 'peter'. But my datasets contain many companies so there will be many 'peter''s in my dataset that are not the same person. Therefore, I want to match on similarity in multiple columns.

I want to match these two datasets based on the information in all columns. I think I need Levenshtein Similarity and compare.linkage for this, however I did not succeed.

Does anyone know how I can match this? Any help would be greatly appreciated.

First, You need to share a reproducible example and second, if there is a common Pkey in between both datasets, you wouldn't need similarity approach at all, may be some data cleaning i.e. using reg exp to convert & to and can make it work well. — Rana Usman, Mar 15 '18 at 13:46
Thank you Rana, I adjusted the question. Unfortunately, there is no common Pkey that I can use to match the datasets. — Alice M, Mar 15 '18 at 14:11

score 0 · Answer 1 · answered Mar 15 '18 at 21:00

Since data is not available and there's not much input I can give, but this should get you started.

I created small dataset based on your question.

df <- data.frame(name="Peter Sanders", firstname="peter", lastname = "sanders", 
                 org= "coffee&cake")

df1 <- data.frame(name="Peter Sandesadasdasdasr", firstname="peter", lastname = "sander", 
                 cname= "coffee and cake")

I used the builtin dist() function of R to find similarity using manhattan distance

dist(cbind(unlist(df1), unlist(df)), "manhattan")

Result

          name firstname lastname
firstname    2                   
lastname     4         2         
cname        6         4        2

match two datasets with record linkage in R

1 Answers1