-1

I'm looking for a way to merge two data files based on partial matching of participants' full names that are sometimes entered in different formats and sometimes misspelled. I know there are some different function options for partial matches (eg agrep and pmatch) and for merging data files but I need help with a) combining the two; b) doing partial matching that can ignore middle names; c) in the merged data file store both original name formats and d) retain unique values even if they don't have a match.

For example, I have the following two data files:

File name: Employee Data

Full Name Date Started Orders ANGELA MUIR 6/15/14 25 EILEEN COWIE 6/15/14 44 LAURA CUMMING 10/6/14 43 ELENA POPA 1/21/15 37 KAREN MACEWAN 3/15/99 39

File name: Assessment data

Candidate Leading Factor SI-D SI-I Angie muir I -3 12 Caroline Burn S -5 -3 Eileen Mary Cowie S -5 5 Elena Pope C -4 7 Henry LeFeuvre C -5 -1 Jennifer Ford S -3 -2 Karen McEwan I -4 10 Laura Cumming S 0 6 Mandip Johal C -2 2 Mubarak Hussain D 6 -1

I want to merge them based on names (Full Name in df1 and Candidate in df2) ignoring middle name (eg Eilen Cowie = Eileen Mary Cowie), extra spaces (Laura Cumming = Laura Cumming); misspells (e.g. Elena Popa = Elena Pope) etc.

The ideal output would look like this:

Name Full Name Candidate Date Started Orders Leading Factor SI-D SI-I ANGELA MUIR ANGELA MUIR Angie muir 6/15/14 25 I -3 12 Caroline Burn N/A Caroline Burn N/A N/A S -5 -3 EILEEN COWIE EILEEN COWIE Eileen Mary Cowie 6/15/14 44 S -5 5 ELENA POPA ELENA POPA Elena Pope 1/21/15 37 C -4 7 Henry LeFeuvre N/A Henry LeFeuvre N/A N/A C -5 -1 Jennifer Ford N/A Jennifer Ford N/A N/A S -3 -2 KAREN MACEWAN KAREN MACEWAN Karen McEwan 3/15/99 39 I -4 10 LAURA CUMMING LAURA CUMMING Laura Cumming 10/6/14 43 S 0 6 Mandip Johal N/A Mandip Johal N/A N/A C -2 2 Mubarak Hussain N/A Mubarak Hussain N/A N/A D 6 -1

Any suggestions would be greatly appreciated!

  • yes that would be ideal, but there is no guarantee with fuzzy text merging because of problematic false positive or false negative results. If the scale of the data isn't that large (less than 500 rows), than you can do it and visually inspect the results. – Pierre L Jun 01 '15 at 23:03
  • 1
    repost your data in R code format, so users can help you properly. – Pierre L Jun 01 '15 at 23:06

1 Answers1

0

Here's a process that may help. You will have to inspect the results and make adjustments as needed.

df1

#            v1      v2
#1  ANGELA MUIR 6/15/14
#2 EILEEN COWIE 6/15/14
#3 AnGela Smith  5/3/14

df2

#                 u1   u2
#1 Eileen Mary Cowie  I-3
#2        Angie muir -5 5

index <- sapply(df1$v1, function(x) {
  agrep(x, df2$u1, ignore.case=TRUE, max.distance = .5)
}
)
index <- unlist(index)
df2$u1[index] <- names(index)
merge(df1, df2, by.x='v1', by.y='u1')

#            v1      v2   u2
#1  ANGELA MUIR 6/15/14 -5 5
#2 EILEEN COWIE 6/15/14  I-3

I had to adjust the argument max.distance in the index function. It may not work for your data, but adjust and test if it works. If this doesn't help, there is a package called stringdist that may have a more robust matching function in amatch.

Data

v1 <- c('ANGELA MUIR', 'EILEEN COWIE', 'AnGela Smith')
v2 <- c('6/15/14', '6/15/14', '5/3/14')
u1 <- c('Eileen Mary Cowie', 'Angie muir')
u2 <- c('I-3', '-5 5')
df1 <- data.frame(v1, v2, stringsAsFactors=F)
df2 <- data.frame(u1, u2, stringsAsFactors = F)
Community
  • 1
  • 1
Pierre L
  • 28,203
  • 6
  • 47
  • 69