0

I am trying to match two datasets in R: datasetA and datasetB. These datasets contain the following columns.

datasetA

  • ID: 15
  • Name: peter sanders
  • First_Name: peter
  • Last_Name: sanders
  • ORG_NAME:coffee&cake
  • City: New York
  • Amount(USD): 10369
  • Category: food & beverages
  • Date: 12/01/2014

datasetB has similar columns:

  • ORG_ID:5241
  • names: peter sander
  • first name: peter
  • last name: sander
  • company_name: coffee and cakes
  • location: New York
  • funded: 10000
  • sub_cat: restaurants
  • start_date: 2013-01-09 16:42:56
  • end_date: 2015-01-04 11:43:39

The only exact match there is is the first name 'peter'. But my datasets contain many companies so there will be many 'peter''s in my dataset that are not the same person. Therefore, I want to match on similarity in multiple columns.

I want to match these two datasets based on the information in all columns. I think I need Levenshtein Similarity and compare.linkage for this, however I did not succeed.

Does anyone know how I can match this? Any help would be greatly appreciated.

Alice M
  • 11
  • 1
  • First, You need to share a reproducible example and second, if there is a common Pkey in between both datasets, you wouldn't need similarity approach at all, may be some data cleaning i.e. using reg exp to convert & to and can make it work well. – Rana Usman Mar 15 '18 at 13:46
  • Thank you Rana, I adjusted the question. Unfortunately, there is no common Pkey that I can use to match the datasets. – Alice M Mar 15 '18 at 14:11
  • reproducible example would do a lot – s_baldur Mar 15 '18 at 14:13

1 Answers1

0

Since data is not available and there's not much input I can give, but this should get you started.

I created small dataset based on your question.

df <- data.frame(name="Peter Sanders", firstname="peter", lastname = "sanders", 
                 org= "coffee&cake")

df1 <- data.frame(name="Peter Sandesadasdasdasr", firstname="peter", lastname = "sander", 
                 cname= "coffee and cake")

I used the builtin dist() function of R to find similarity using manhattan distance

dist(cbind(unlist(df1), unlist(df)), "manhattan")

Result

          name firstname lastname
firstname    2                   
lastname     4         2         
cname        6         4        2
Rana Usman
  • 1,031
  • 7
  • 21