2

What are the options for making a data cleansing process (deduplication/matching) when dealing with MS SQL Server 2008 R2? Or better yet how can I weight scores on a matching process on columns of a row? The situation is the following: I have a persons table on my database and their associated addresses and documents in other database tables? How can I make the best decision of match based on Name, Serial no of the document and address? As I understood SSIS fuzzy groping won't support this feature: weighted scoring.

MariaMadalina
  • 479
  • 6
  • 20
  • Do you only want to consider solutions native to SQL Server? I work for a company that build de-duplication and matching solutions which will integrate with SQL Server. – Akshay Mar 17 '14 at 11:25
  • You should look into 3rd party tools, or alternatively consider upgrading to SQL Server 2012, which ships with Data Quality Services, as there are no other native options than the SSIS fuzzy components in SQL Server 2008 R2. – Dan Mar 17 '14 at 11:37
  • Thanks for your opinion. I've done a thorough research on internet and I have found this articles: http://blog.hoegaerden.be/2011/02/05/finding-similar-strings-with-fuzzy-logic-functions-built-into-mds/ and another one http://www.sqlservercentral.com/articles/Integration+Services+(SSIS)/71486/ – MariaMadalina Mar 18 '14 at 15:23

2 Answers2

3

I do not have much experience with SSIS at the moment - so this answer is focused on the de-duping/matching/scoring aspect of your question.

There are many ways to approach a Data Quality strategy such as this, all of which have Pro's and Cons and I think a lot of it comes down to your existing data management strategies - how clean and standardised is the data you are trying to dedupe?

Even 'simple' items like telephone numbers can be difficult to dedupe if you have not got this correct - for example all of these are different representations of the same number:

  • +1 (888) 707-8822
  • 1-888-707-8822
  • 18887078822
  • 001 888 7078822
  • 888-7078822

The more complex structures such as addresses get even more interesting: are 'flat 2' and 'apartment 2' the same thing or different?

You have two choices - make it your self or trust a third party

Make it yourself

  • Advantages
    • Lots of fun logical problems to work through
    • Will be able to tweak and improve at will 'forever' as your solution grows
  • Disadvantages
    • It will take a lot of time.
    • Each country you use will need looking at separately - there are no high quality 'global' rules that you can apply (but there of of course snippets that can be reused)

Third Party

  • Advantages
    • If de-duplication is not your specialty - let the experts do it
    • Ready to go and deliver value immediately
  • Disadvantages
    • Cost

Whether you go your own route or third party I suggest you start by creating a clear goal.

What are your inputs:

  • How 'clean' is your data?
  • How standardised is your data?
  • How do the records link together.
  • Are the address records just from one country or are they from several.

What are your workflows:

  • How often do you need to run this process?
  • Do you want to stop duplicates entering your system in the first place or just run periodic bulk runs?

What do you want from the project?

  • To what level (document, person, household, organisation - see below) do you want to identify duplicates
  • What do you want to do with those duplicates
    • Delete duplicates and keep one record
    • Merge duplicates to create one master record
    • This stage is sometimes refereed to as creating the 'Golden' record. Deciding which information to keep, and which information to disregard.

To go into a bit more detail about some of those choices, consider the following dummy addresses:

Are you trying to dedupe to household level:

  • Ann Smith, 1 main st, DupeVille, MA, 12345
  • Bob Smith, 1 main street, DupeVille, MA, 12345

become

  • Ann and Bob Smith, 1 Main St, DupeVille, MA, 12345-6789

Person Level

  • Robert Smith, 1 main st, DupeVille, MA, 12345
  • Bob Smith, 1 main street, DupeVille, MA, 12345

become

  • Robert Smith, 1 Main St, DupeVille, MA, 12345-6789

or even by the ID's in your document database.

Once you have that plan, it may help you make up your mind about the best route to take. If you want to create it yourself, the links you have found certainly put you in the right mindset. If you want to go third party - there are a good range of suppliers out there. Just make sure you choose someone you can trust - they're going to be changing your data!

Google around for the various suppliers - Experian Data Quality are one of them (my company!) and depending upon where in the world you are, you can find your best contact details and more info here: http://www.qas.com/contact/office-locations.htm . We have tools that can integrate with SQL Server 2008 R2 which can score differing input types and then automatically dedupe these for you or return the clusters of potentially groups for your to look after yourself.

Take your plan, and clear idea of what you need from them and discuss it with them. Whoever you choose will be able to talk you through your plan, discuss your goals and tell you if they are the right people for the job.

Think I went on a bit there :-) but hopefully that points you in the right direction - Good luck!

Al Mills
  • 1,072
  • 6
  • 22
  • Nice write up. Doing this yourself is very difficult. You have to consider all sorts of things as you point out. Not only fuzzy string comparison, but also nick names, address quality, etc. We went with a third party [API](http://matchbox.io) which allowed us to retain our own data in the database, but still deduce in real-time. – Joel Oct 16 '15 at 13:34
0

If you do fuzzy grouping with multiple columns you will get _similarity information for every column you choose as input. With this similarity information you can calculate your own tresholds etc.

Narti
  • 181
  • 7