Merging datasets by name when names have different formats in R

Question

I have two different dataframes in R that I am trying to merge together. One is just a set of names and the other is a set of names with corresponding information about each person.

So say I want to take this first dataframe:

Name
1. Blow, Joe
2. Smith, John
3. Jones, Tom 
etc....

and merge it to this one:

   DonorName  CandidateName DonationAmount CandidateParty
1   blow joe Bush, George W          3,000     Republican
2   guy some  Obama, Barack          5,000       Democrat
3 smith john    Reid, Harry          4,000       Democrat

such that I'd have a new list that includes only people on my first list with the information from the second. Were the two "Name" values formatted in the same way, I could just use merge(), but would there be a way to somehow use agrep() or pmatch() to do this?

Also, the 2nd dataframe I'm working with has about 25 million rows in it and 6 columns, so would making a for loop be the fastest way to go about this?

Reproducible versions of the example data:

first <- data.frame(Name=c("Blow, Joe","Smith, John","Jones, Tom"),
         stringsAsFactors=FALSE)

second <- read.csv(text="
DonorName|CandidateName|DonationAmount|CandidateParty
blow joe|Bush, George W|3,000|Republican
guy some|Obama, Barack|5,000|Democrat
smith john|Reid, Harry|4,000|Democrat",header=TRUE,sep="|",
stringsAsFactors=FALSE)

Are your formats always the same in each dataset, or do they vary within each set? — thelatemail, May 27 '14 at 02:20
@StanO has this question been answered which an effective solution? if not please suggest what's need otherwise accept the appropriate solution or write+accept your own if it's not provided below — npjc, May 12 '15 at 09:22
Sorry, I thought I had checked your answer, but it seems I didn't. Belated thanks for the great advice on this! — ModalBro, May 13 '15 at 18:56

score 3 · Accepted Answer · edited May 27 '14 at 04:50

solution:

first$DonorName <- gsub(", "," ",tolower(first$Name),fixed=TRUE)

require(dplyr)

result <- inner_join(first,second,by="DonorName")

will give you what you need if the data is as you've provided it.

result
         Name  DonorName  CandidateName DonationAmount CandidateParty
1   Blow, Joe   blow joe Bush, George W          3,000     Republican
2 Smith, John smith john    Reid, Harry          4,000       Democrat

"fast way to go about this"

The dplyr method as above:

f_dplyr <- function(left,right){
   left$DonorName <- gsub(", "," ",tolower(left$Name),fixed=TRUE)
   inner_join(left,right,by="DonorName")
}

data.table method, setting key on first.

f_dt <- function(left,right){
   left[,DonorName :=  gsub(", "," ",tolower(Name),fixed=TRUE)]
   setkey(left,DonorName)
   left[right,nomatch=0L]
}

data.table method, setting both keys.

f_dt2 <- function(left,right){
   left[,DonorName :=  gsub(", "," ",tolower(Name),fixed=TRUE)]
   setkey(left,DonorName)
   setkey(right,DonorName)
   left[right,nomatch=0L]
}

base method relying on sapply:

f_base <- function(){
  second[second$DonorName %in%
  sapply(tolower(first[[1]]), gsub, pattern = ",", replacement = "", fixed = TRUE), ]
}

let's make second df a bit more realistic at 1M obs for a fairish comparision:

second <- cbind(second[rep(1:3,1000000),],data.frame(varn= 1:1000000))
left <- as.data.table(first)
right <- as.data.table(second)

library(microbenchmark)

microbenchmark(
          f_base(),
          f_dplyr(first,second),
          f_dt(left,right),
          f_dt2(left,right),
          times=20)

And we get:

Unit: milliseconds
                   expr       min        lq    median        uq       max neval
               f_base() 2880.6152 3031.0345 3097.3776 3185.7903 3904.4649    20
 f_dplyr(first, second)  292.8271  362.7379  454.6864  533.9147  774.1897    20
      f_dt(left, right)  489.6288  531.4152  605.4148  788.9724 1340.0016    20
     f_dt2(left, right)  472.3126  515.4398  552.8019  659.7249  901.8133    20

On my machine, with this ?contrived example we gain about 2.5 seconds over base methods. sapply simplifies and doesn't scale very well in my experience... this gap likely gets bigger when you increase the number of unique groups in first and second.

Please feel free to edit if you come up with more efficient use. I don't pretend to know, but I always try to learn something.

Nice, but this only covers one aspect of benchmarking - namely large number of rows. It has a total of 5 unique groups. [Here's a gist](https://gist.github.com/arunsrinivasan/db6e1ce05227f120a2c9) illustrating other two factors that I've come to know of from various benchmarks - number of unique groups and the number of columns the join is performed on. — Arun, May 29 '14 at 00:38
@Arun I suspected it may be the case that punches be traded depending on the data structure. Now I think the philosophies are _?fundamentally_ different but I would certainly find the idea that some friendly competition + a wealth of public fair comparisons be provided as documentation on both repositories homepages. Dare I say a great time waster would be a hosted `Shiny` app? — npjc, May 29 '14 at 16:48
Right. It's just very time consuming, especially covering all cases. I've attempted quite a few cases already though. Hopefully we'll get them out on our homepage sometime soon. — Arun, May 29 '14 at 17:03
@Arun. Well you know us end users, parasitic behaviour built in. That being said, perhaps it would be interesting to just start a shared repo that acts simply as a "results dump". So anyone could submit their gist / pull request (say in the form of a cached .rmd) of a specific comparison on their own machine and then maintainers can validate before merging in. Then you also get to see how reproducible a comparison (experiment) is across not just one user but a community of users. just FFT (_food for thought_). — npjc, May 29 '14 at 17:10
One big problem with that is, I don't consider benchmarking 'ms' or 'us' as much useful (on tools developed for large data). I'd like to *stress test* and see where each one breaks. I've been benchmarking on 1/2 - 1 billion rows with varying group size. I (and others can) see very clearly how well each tool scales or breaks. — Arun, May 29 '14 at 17:16

score 0 · Answer 2 · answered May 27 '14 at 02:54

Without dplyr:

second[second$DonorName %in%
  sapply(tolower(first[[1]]), gsub, pattern = ",", replacement = "", fixed = TRUE), ]

Result:

#     DonorName  CandidateName DonationAmount CandidateParty
# 1   blow joe  Bush, George W          3,000     Republican
# 3 smith john     Reid, Harry          4,000       Democrat

Merging datasets by name when names have different formats in R

2 Answers2

solution:

"fast way to go about this"