I am trying to match two separate columns in two entirely different datasets, and I am looking for the best way to achieve this. It need not be a complete match, but also a partial match. I tried COMPGED method, but it took a long time, partly because there's no proper join condition I suppose.
Below is the sample data from two datasets. This is not the actual data, but the structure is similar
Dataset 1:
Record Count: Approx 1894500
ID name
124 senatorltd
122 pipelineinc
1034 bellcorp
1934 sachetinc
5463 altd
1928 jokerinc
Dataset 2:
Record Count: Approx 1637500
COMPANYNAME
altd
thousandislandsinc
picturesinc
nellynellycorp
cellbellassoc
neyyorkinc
Goal is to try to match the names in the two datasets. I tried below approach hoping that I could get a distribution and then filter the ones with compged_vlue=0, but it takes too much time due to the cross join
proc sql;
CREATE table mylibrary.companynamematching as (
select
a.id,
a.name,
b.companyname,
compged(a.name,b.companyname) as compged_vlue
from mylibrary.dataset1 as a cross join mylibrary.dataset2 as b
where compged(a.name,b.companyname)<=140
);
quit;
I'm open to suggestions on how to solve this the most efficient way. Appreciate any assistance, Thank you!