Suggestions on data deduplication Software?

Question

I am looking for some dedupe software that is compatible with MS SQL Server. I have a rather extensive and messy table that contains addresses from all over the world in all different languages. The table is set up to handle dupes as parent/child records so some functionality to handle a match is required (ie not just deleting a dupe).

Edit: Here's the structure

ParentID | MasterID | PropertyName | Address1 | Address2 | PostalCode | City | StateProvinceCode | CountryCode | PhoneNumber

The MasterID is unique for each record.

ParentID contains the MasterID for the parent record of each entry, and the parent record is where the MasterID = ParentID.

CountryCode is the two letter ISO country code (not telephone code).

I would use SQL for this, it's great at de-duping. If you post your table structure and criteria we can probably help you with a query. — JNK, Aug 22 '11 at 20:24
Are you looking at more than just matching identical addresses, i.e. do you want to consider 123 N. Main Street as a duplication of 123 North Main St.? — hatchet - done with SOverflow, Aug 22 '11 at 20:27
see http://stackoverflow.com/questions/291728/open-source-address-scrubber — hatchet - done with SOverflow, Aug 22 '11 at 20:35
@JNK: Updated the question with the table structure. Thanks for any help! — copjon, Aug 22 '11 at 20:37
I work for a company that does this on the scale of hundreds pf millions a year and our dedupe software runs to 3 programs and about 30k lines of code. My best advice would be to ensure that whatever you do you have start off with a character encoding that can pass all this data. As @JNK says SQL can be good for this; though C and python have a lot of built in flexibility. — Ben, Aug 22 '11 at 21:46

score 2 · Accepted Answer · answered Aug 22 '11 at 20:28

2

Address duplicates are notoriously difficult to track down. There are about 10 valid ways to write one address, which can make for problems.

The fact that you have business rules that allow for duplicates some of the time makes me think you might be better off rolling your own piece of software to find unacceptable dupes and remove them.

In the past I have done this with addresses by putting the address through a free geo-coding service (Google's mapping API for instance) and looking for points that are within a certain threshold of each other (10 feet or something). At this point you can determine if it qualifies as an "unacceptable duplicate" and delete it.

To find distances between coordinates I would recommend finding the Great Circle Distance. Good luck!

answered Aug 22 '11 at 20:28

Abe Miessler

82,532
99
305
486

if you do it this way, how do avoid 123 N. Main St. #100 from being considered a dupe of 123 N. Main St. #102 (the office across the hall in the same building that is literally 10 feet away)? – hatchet - done with SOverflow Aug 22 '11 at 20:45
I suspect that office numbers in the same building would geo-code to the same location. But if not, then you would need to add logic in to look for this type of thing (expand the threshold or look for the same street number and name for instance). It's not perfect, but I have yet to find a solution that works better for addresses. – Abe Miessler Aug 22 '11 at 20:54
I think this could work, and we've used both Google's and Bing's/MSFT's geocoding APIs to geocode other data of ours before. My concern is that this table is 200k+ records and I don't believe it is free for that many requests. – copjon Aug 22 '11 at 20:59
Yeah, I believe the last time I checked it was capped at 30k free a day for Google. You can look at using multiple APIs to geocode? It might take a few days to get everything geocoded, but if that fits in your time frame it might work. Or you can bite the bullet and pay for the service... I ran into the same issue and we just ended up running it everyday for a week or so. – Abe Miessler Aug 22 '11 at 21:02

Suggestions on data deduplication Software?

1 Answers1