0

I've got a database with property owners; I would like to count the number of properties owned by each person, but am running into standard mismatch problems:

REDEVELOPMENT AUTHORITY vs. REDEVELOPMENT AUTHORITY O vs. PHILADELPHIA REDEVELOPMEN vs. PHILA. REDEVELOPMENT AUTH

COMMONWEALTH OF PENNA vs. COMMONWEALTH OF PENNSYLVA vs. COMMONWEALTH OF PA

TRS UNIV OF PENN vs. TRUSTEES OF THE UNIVERSIT

From what I've seen, this is a pretty common problem, but my problem differs from those with solutions I've seen for two reasons:

1) I've got a large number of strings (~570,000), so computing the 570000 x 570000 matrix of edit distances (or other pairwise match metrics) seems like a daunting use of resources

2) I'm not focused on one-off comparisons--e.g., as is most common for what I've seen from big data fuzzy matching questions, matching user input to a database on file. I have one fixed data set that I want to condense once and for all.

Are there any well-established routines for such an exercise? I'm most familiar with Python and R, so an approach in either of those would be ideal, but since I only need to do this once, I'm open to branching out to other, less familiar languages (perhaps something in SQL?) for this particular task.

MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
  • 1
    My first thoughts were to use the qdap package to find the most frequent terms, then do a series of gsubs (or for more power use the gsubfn package) and convert common variations to a standardized form. Second, delete all prepositions and the's, etc. Third, shorten all names to the first two or perhaps three words. At that point try agrep and see if the matches can be done reasonably efficiently. – lawyeR Dec 09 '14 at 16:44

2 Answers2

1

That is exactly what I am facing at my new job daily (but lines counts are few million). My approach is to:

1) find a set of unique strings by using p = unique(a) 2) remove punctuation, split strings in p by whitespaces, make a table of words' frequencies, create a set of rules and use gsub to "recover" abbreviations, mistyped words, etc. E.g. in your case "AUTH" should be recovered back to "AUTHORITY", "UNIV" -> "UNIVERSITY" (or vice versa) 3) recover typos if I spot them by eye 4) advanced: reorder words in strings (to often an improper English) to see if the two or more strings are identical albeit word order (e.g. "10pack 10oz" and "10oz 10pack").

Alexey Ferapontov
  • 5,029
  • 4
  • 22
  • 39
  • I wish there were a better way, but I think this is the most effective. Even better would be if people had some damn consistency when creating string data, but _c'est la vie_. – MichaelChirico Jul 10 '15 at 18:25
1

You can also use agrep() in R for fuzzy name matching, by giving a percentage of allowed mismatches. If you pass it a fixed dataset, then you can grep for matches out of your database.

darwin
  • 433
  • 2
  • 7