0

I have two tables from two different databases, and both contain lastName and firstName columns. I need to create JOINa relationship between the two. The lastName columns match about 80% of the time, while the firstName columns match only about 20% of the time. And each table has totally different personID primary keys.

Generally speaking, what would be some "best practices" and/or tips to use when I add a foreign key to one of the tables? Since I have about 4,000 distinct persons, any labor-saving tips would be greatly appreciated.

Sample mismatched data:

db1.table1_____________________  db2.table2_____________________
23    Williams       Fritz       98   Williams       Frederick
25    Wilson-Smith   James       12   Smith          James Wilson
26    Winston        Trudy       73   Winston        Gertrude

Keep in mind: sometimes they match exactly, often they don't, and sometimes two different people will have the same first/last name.

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
KiloVoltaire
  • 265
  • 3
  • 10
  • there is no direct relation at all between that sample data - should that result in a 'join' or not? – pala_ May 25 '15 at 04:39
  • This showed worst-case examples of mismatched data. But your excellent answer below accurately reflects how to get past the solid matches first, and then deal with mismatched as well as duplicate data. And yes, there are other fields which will help, such as what US state the person is from. I'll also try a little substring and regex functions as well; such as lastName plus firstInitial. Thanks again! – KiloVoltaire May 25 '15 at 05:16
  • @KiloVoltaire I would like to suggest one bit different possiblility to get partially matching rows - if you have multiple columns (you say State), then you can do join on "number of matching columnsL - say you have only 3 - first/last name and state, and you can write is so match in any two of them will do, so same first and state, same last and state etc.. more columns you have the better results you get, only you need to get the right threshold (4 of 7 will return much more than 6 of 7). And you could compute the score on string difference too (Levenshtein distance maybe). – jkavalik May 25 '15 at 06:02

1 Answers1

1

You can join on multiple fields.

select * 
  from table1
    inner join table2
      on table1.firstName = table2.firstName
        and table1.lastName = table2.lastName

From this you can determine how many 'duplicate' firstname / last name combos there are.

select table1.firstName, table2.lastName, count(*)
  from table1
    inner join table2
      on table1.firstName = table2.firstName
        and table1.lastName = table2.lastName
  group by table1.firstName, table2.lastName
  having count(*) > 1

Conversely, you can also determine the ones which match identically, and only once:

select table1.firstName, table2.lastName
  from table1
    inner join table2
      on table1.firstName = table2.firstName
        and table1.lastName = table2.lastName
  group by table1.firstName, table2.lastName
  having count(*) = 1

And this last query could be the basis for performing the bulk of your foreign key updates.

For those names that match more than once between the tables, they'll likely need some sort of manual intervention, unless there are other fields in the table that can be used to differentiate them?

pala_
  • 8,901
  • 1
  • 15
  • 32