5

I have two lists of names for the same set of students which have been collected separately. There are numerous typographical errors and I have been using fuzzy matching to link the two lists. I am 99+% there with agrep and similar, but am stuck on the following basic problem: how can I match (for example) the forenames "Adrian Bruce" and "Bruce Adrian"? The Levenshtein edit distance is no good for this particular case as it counts number of substitutions.

This must be a very common problem, but I cannot find any standard R package or routine for addressing it. I presume I am missing something obvious...???

smci
  • 32,567
  • 20
  • 113
  • 146
Jonathan Burley
  • 771
  • 1
  • 6
  • 8
  • As @Ritchie Cotton pointed out, how do you handle 3+ names, optional hyphenation in last name? You could split on both ' ' and '-'. Seems to me you can set a canonical ordering by just reordering the name-tuples in alphabetical order: `cat( sort(c('Smith','John')), collapse='') gives 'John Smith'` – smci May 17 '14 at 20:55
  • I edited your title to specify order-independence with *"Firstname Lastname"/"Lastname Firstname"*. Please reedit if you need more generality. – smci May 17 '14 at 21:01

2 Answers2

3

Well, one fairly easy way is to swap the words and match again...

y=c("Bruce Almighty", "Lee, Bruce", "Leroy Brown")
y2 <- sub("(.*) (.*)", "\\2 \\1", y)

agrep("Bruce Lee", y)  # No match
agrep("Bruce Lee", y2) # Match!
Tommy
  • 39,997
  • 12
  • 90
  • 85
  • sub - another new command to me at least. Splendid, thanks Tommy. – Jonathan Burley Feb 02 '12 at 20:24
  • 1
    @JonathanBurley: Watch out for non standard names. You should test your code against `c("Lulu", "Ho Chi Minh", "Hugh Fearnley-Whittingstall", NA)`. – Richie Cotton Feb 03 '12 at 11:19
  • @JonathanBurley: `grep,grepl,regexpr,gregexpr,regexec,sub,gsub` and `match/pmatch` (see related: `charmatch`) are all essentially the same underlying command. Gotta love the R language! Feels like PHP for a new generation! – smci May 17 '14 at 20:59
0

The technique I usually use is pretty robust and relatively insensitive to ordering, punctuation, etc.. It's based on objects called "n-grams". If n=2, "bigrams". For instance:

"Adrian Bruce" --> ("Ad","dr","ri","ia","an","n "," B","Br","ru","uc","ce")
"Bruce Adrian" --> ("Br","ru","uc","ce","e "," A","Ad","dr","ri","ia","an")

Each string has 11 bigrams. 9 of them are in common. Thus, the similarity score is very high: 9/11 or 0.818 where 1.000 is a perfect match.

I am not very familiar with R, but if a package does not exist, this technique is very easy to code. You can write a code that loops through the bigrams of string 1 and tallies how many are contained in string 2.

Mattia
  • 1
  • 1