1

Say I have two tables, one called InternalCustomers which contains all my customers, and a table called ExternalCustomers which contains customer information from an external source.

In my table I have a Legal_Id for the customer which is an official id for the company, which also exists in the external table. However my table contains a mix of a lot of customers from various sources (and data quality is not perfect) and the Legal_Id is therefore not necessarily unique.

Further my Customer_Name is not the legally registered name for the Company, but has just been typed in by an account manager, so it may read something different than the official name (e.g. may read Google Incorporated rather than Google Inc.)

My challenge is to join these two tables together with some sort of Pattern-matching, but I don't know if this is even possible?

InternalCustomers (17 mio different customers):

| Legal_Id | Customer_Name          |
|----------|------------------------|
| 1234     | Chr.Hansen Group       |
| 1235     | Apple Inc              |
| 1236     | A.P. Moller Maersk A/S |

ExternalCustomers (no control):

| Legal_Id | Customer_Name          |
|----------|------------------------|
| 1234     | Christian Hansen AS    |
| 1235     | Nokia                  |
| 1236     | Mærsk                  |

Expected result when doing a left join:

| Legal_Id | Customer_Name          | Legal_Id (External) | Customer_Name (External) |
|----------|------------------------|---------------------|--------------------------|
| 1234     | Chr.Hansen Group       | 1234                | Christian Hansen AS      |
| 1235     | Apple Inc              | NULL                | NULL                     |
| 1236     | A.P. Moller Maersk A/S | 1236                | Mærsk                    |

So to summarize: I want to do a left join between the two tables on Legal_Id and a pattern-match on Customer_Name if possible.

Apple <> Nokia, therefore this join should not be completed - but A.P. Moller Maersk A/S = Mærsk and this join should be completed.

Customer names are not always easily identifiable so that I could just match a partial name - e.g. A.P Moller Maersk A/S would be fairly easy for a human to match up to Mærsk, but a machine will say that Maersk <> Mærsk.

Is this in any way achievable?

ssn
  • 509
  • 3
  • 12
  • 1
    It's possible but you will have to determine what would be considered a match. Regardless, you are likely going to miss some and get some false positives. Best to clean it up first and then analyze your data, even if this is cumbersome and time consuming. – S3S Aug 31 '17 at 14:13
  • I realize that I will probably get some false positives, which is okay for our purpose, as long as it fits within a reasonable amount. Cleaning up data is unfortunately not really an option because I have 17 mio. customers with no "check-list", so it would be a manual process - that is why I wanted to do the "pattern match" somehow. – ssn Aug 31 '17 at 14:22
  • You have to define *reasonable* and explain what you would consider a match. For example, changing all `Incoporated` tin `Inc.` or visa versa is easy. However, something like `A.P. Moller Maersk A/S | 1236 | Mærsk ` is a lot harder. You'd have to basically accept that any word in Customer_Name matches any work in Custmer_name (External) is a match. – S3S Aug 31 '17 at 14:25
  • *Reasonable* may be fluffy, if I get 20-30% false positives may be too much, but 10-20% may be okay - depending on how many of my customers get a match at all. I guess accepting one word would bring me a long way, because I would find it unlikely that the same `Legal_Id` and some part of the name would match coincidentally. However in this case the `A.P. Moller Maersk A/S | 1236 | Mærsk` would not get at match. – ssn Aug 31 '17 at 14:29
  • How often do you need to do this? Can a job that runs once a day to populate a matching table work for you? Or do you need real time info? – HLGEM Aug 31 '17 at 14:34
  • It will pretty much be a one time thing, I will then store the information in my database. – ssn Aug 31 '17 at 14:42

1 Answers1

0

I wouldn't call this a robust solution (and certainly not elegant), but it does handle the provided examples. note: SplitCSVToTable8K is renamed version of the popular DelimitedSplit8K function.

IF OBJECT_ID('tempdb..#TableA', 'U') IS NOT NULL 
DROP TABLE #TableA;
GO 
CREATE TABLE #TableA (
    Legal_Id INT NOT NULL,
    Customer_Name VARCHAR(30) NOT NULL 
    );
INSERT #TableA (Legal_Id, Customer_Name) VALUES
    (1234, N'Chr.Hansen Group'),
    (1235, N'Apple Inc'),
    (1236, N'A.P. Moller Maersk A/S');
GO 
IF OBJECT_ID('tempdb..#TableB', 'U') IS NOT NULL 
DROP TABLE #TableB;
GO 
CREATE TABLE #TableB (
    Legal_Id INT NOT NULL,
    Customer_Name VARCHAR(30) NOT NULL 
    );
INSERT #TableB (Legal_Id, Customer_Name) VALUES
    (1234, N'Christian Hansen AS'),
    (1235, N'Nokia'),
    (1236, N'Mærsk');
GO

SELECT 
    ta.Legal_Id,
    ta.Customer_Name,
    [Legal_Id (Ext)] = tb.Legal_Id,
    [Customer_Name (Ext)] = tb.Customer_Name,
    sca.Item,
    scb.Item
FROM 
    #TableA ta
    JOIN #TableB tb
        ON ta.Legal_Id = tb.Legal_Id
    CROSS APPLY dbo.SplitCSVToTable8K(REPLACE(ta.Customer_Name, '.', ' '), ' ') sca
    CROSS APPLY dbo.SplitCSVToTable8K(REPLACE(tb.Customer_Name, '.', ' '), ' ') scb
WHERE 
    '%' + CAST(sca.Item AS NVARCHAR(4000)) + '%' LIKE '%' + CAST(scb.Item AS NVARCHAR(4000)) + '%';

enter image description here

Jason A. Long
  • 4,382
  • 1
  • 12
  • 17