0

I am trying to match two separate columns in two entirely different datasets, and I am looking for the best way to achieve this. It need not be a complete match, but also a partial match. I tried COMPGED method, but it took a long time, partly because there's no proper join condition I suppose.

Below is the sample data from two datasets. This is not the actual data, but the structure is similar

Dataset 1:
Record Count: Approx 1894500

ID     name
124    senatorltd
122    pipelineinc
1034   bellcorp
1934   sachetinc
5463   altd
1928   jokerinc



Dataset 2:
Record Count: Approx 1637500

COMPANYNAME
altd
thousandislandsinc
picturesinc
nellynellycorp
cellbellassoc
neyyorkinc

Goal is to try to match the names in the two datasets. I tried below approach hoping that I could get a distribution and then filter the ones with compged_vlue=0, but it takes too much time due to the cross join

proc sql;
CREATE table mylibrary.companynamematching as (
select
    a.id,
    a.name,
    b.companyname,
    compged(a.name,b.companyname) as compged_vlue
from mylibrary.dataset1 as a cross join mylibrary.dataset2 as b
where compged(a.name,b.companyname)<=140
); 
quit;

I'm open to suggestions on how to solve this the most efficient way. Appreciate any assistance, Thank you!

Chug
  • 31
  • 6
  • I can't see any way to do this without having to process the product of the cross (cartesian) join. You could do a much more efficient search for initial substring matches - ie for cases where the first part of NAME and COMPANYNAME are similar - by sorting and just comparing names that start with the same substring. But that would mean that 'bellcorp' would not match with 'mybellcorp', for example. You'd have to decide if that's good enough. Maybe you can give some examples of non-identical names that you would want to count as matches in your results. – Chris Long Oct 30 '19 at 09:50
  • @ChrisLong Thank you! I compared the first 8 letters of each string to do the comparison. The code runs faster definitely. But I'm wondering if this is the best way to do it, and if there's not a better way? I'm open to ideas and suggestions – Chug Oct 31 '19 at 17:59
  • I was thinking a bit more about this... I should note that I don't have any particular experience with this type of text processing and there's probably a lot of academic research available, which I haven't read. One line of attack might be to process both your data sets to generate new data sets containing every substring of, say, five characters from each name. So 'senatorltd' would generate 'senat', 'enato', 'nator', etc. Then you could sort those and do a direct merge to find names with similar substrings. The two data sets would be large, but much smaller than the cartesian product. – Chris Long Nov 01 '19 at 11:08

0 Answers0