0

I have a spreadsheet with values like address, name, IBAN, e-mail and want to identify when a customer last time bought something.

The problem is: some fields contain spelling mistakes, others were deliberately entered wrong.

On GitHub, several libraries like https://github.com/seatgeek/fuzzywuzzy, https://github.com/seamusabshere/fuzzy_match or https://github.com/atom/fuzzaldrin are available to perform fuzzy searches based on a single and comparable column. But I want to combine multiple fields - this sounds like a common problem and I expected to find existing solutions out there.

Can you recommend approaches for such a problem? Are there existing projects for such a problem which I am missing? Is a regular string-distance over all the fields usually good enough?

Georg Heiler
  • 16,916
  • 36
  • 162
  • 292

2 Answers2

1

I mentioned it in your other question, but the dedupe python library does what you want.

Basically, it calculates the distance between each field in a pair of rows, then learns optimal weights to combine those distances into a single record-pair score.

fgregg
  • 3,173
  • 30
  • 37
0

So far I believe http://blog.yhat.com/posts/fuzzy-matching-with-yhat.html and using fuzzyWuzzy seems to be the best approach.

Georg Heiler
  • 16,916
  • 36
  • 162
  • 292