0

In the sample data, I've listed the names of employers of a particular person(a prospective customer) which we received from 2 different sources. I've been trying to find a way to better match the two names and get good results. (Currently, it's being done as a manual job) I don't think I'm trying to do the impossible...but if it's not achievable, please don't be harsh!

The below is the dataset which is a "match" as per manual verification.

        ADDUS==============================================Addus Home Care
        Amazon.com, Inc. and its affiliates=====================Amazon.com
        Aon========================================Aon Service Corporation
        ARAMARK Food & Support Svc.================================Aramark
        AT&T Mobility Services LLC===========================AT&T Mobility
        CDW, LLC===========================================CDW Corporation
        Lurie Children's Hospital of Chicago======Lurie Childrens Hospital
        Securitas Security Services USA, Inc============Securitas security
        The PNC Financial Services Group, Inc.======================PNC NA
        United States Department of Homeland Security====US Homeland Securiti
        TCS=========================================Tata Consultancy Services

Although almost obvious, let me state them for the sake of emphasis.

  1. There might be spelling mistakes in names from either of these sources
  2. There might be abbreviations(Ex: TCS in one place and Tata Consultancy in another)

Please suggest me an algorithm or a way to do this with least number of "wrong acceptance cases" - by which I meant cases like this, which have gotten high match ratios from different algorithms.

enter image description here

Please try to suggest a way of doing this.

Aksen P
  • 4,564
  • 3
  • 14
  • 27
Yashu046
  • 51
  • 6
  • Well, I thought I can expect an idea or two from stackoverflow. But I guess no! If there are any inputs anyone can provide, please do so. – Yashu046 Jul 21 '19 at 02:32

1 Answers1

0

I see only one, but over the time pretty progressive and accurate option: (1) first the caveat: you have your 'manual job' and you will stick with it. (2) but now the better part: the manual job is getting shorter and shorter the more data you have had classified over the time - kind of self learning machine. See the following attempt description, if you are interested, we can discus the details at later time.

1. Yur current workflow
    1. create a initial employer list of triplets.
        1. employer1 (string)
        2. employer2 (string)
        3. equivalence (values {VALID|INVALID}), default: INVALID

       Result: AllEpmployersList, unverified.

    2. Process the AllEpmployersList manually
        1. for each AllEpmployersList member (triplet) 
            1. set the value for equivalence element
               VALID or INVALID respectively.

       Result: VerifiedEpmployersList, triplets  with verified equivalence value.

    3. Use the VerifiedEpmployersList as required for downstream processing.

2. The Adapted (advanced) new workflow
    1. create a initial employer list of triplets.
        1. employer1 (string)
        2. employer2 (string)
        3. equivalence (values {VALID|INVALID}), default: INVALID

       Result: AllEpmployersList, unverified.

    2. feed unverified AllEpmployersList into matchKnownEmployers process (described later).

       Result: two lists, AllKnownEmployers and AllUnknownEployers.

    3. Process the AllUnknownEployers list manually.
       Result: VerifiedEpmployersList with verified equivalence value.

    4. feed the VerifiedEpmployersList list into importKnownEmployers process

    5. feed (again) the AllEpmployerList (Result 2.1) into matchKnownEmployers process

       Result:two lists, AllKnownEmployers and AllUnknownEployers.

    6. Use the AllKnownEmployers as required for downstream processes.


3. Required Investments (instances you have to establish)
    1. create KnownEmployers database
        1. create table knownEmployerNames,
            1. columns:
                1. id
                2. employerName
                3. aliasIdValue
        2. create table lastAliasIdValue
            1. columns:
                1. aliasIdValue
        3. init table lastAliasIdValue
            1. insert one initial row, aliasIdValue = 0

    2. create matchKnownEmployersProcess with this characteristics:
        1. Input data:  employerList (triplets)
        2. init empty list for knownEmployers and unknownEployers
        3. for each member in employerList do:
            1. if employer1 and  employer2 in table knownEmployerNames and employer1::aliasIdValue equals employer2::aliasIdValue
                1. then set member::equivalence value to VALID and append the member into knownEmployers list
                2. else  append the member  into  unknownEployers list
        4. Output data: two lists, knownEmployers and unknownEployers.

    3. create importKnownEmployersProcess with this characteristics:
        1. Input data: employerList (triplets)
        2. for each element in employerList do:
            1. if equivalence element value is VALID
                1. insert new pattern
                    1. if employer1 or employer2 is in table knownEmployerNames
                        1. then 
                            1. function isUnknown(employer1, employer2) {
            retVal = {}
            retVal[‘aliasIdValue’] = 
                employer1::aliasIdValue ||
                employer2::aliasIdValue
            retVal[‘newEmployer’]  = 
                 (!employer1 || !employer2)
            return retVal
                           }
                            2. aliasIdValue, newEmployer = isUnknown(employer1,  employer2)
                            3. insert aliasIdValue, newEmployer into knownEmployerNames table
                        2. else 
                            1. fetch and increment aliasIdValue from lastAliasIdValue table
                            2. insert into knownEmployerNames (employer1, aliasIdValue) and (employer2, aliasIdValue)
                            3. update incremented lastAliasIdValue in the  lastAliasIdValue table
        3. Output data: none
harry hartmann
  • 351
  • 1
  • 9