0

I have two files which I would like to match by name and I would like to take account of spelling errors by using the compged function. The names have been thoroughly cleaned and I have no other useful match variables that could be used to reduce the search space.

The files name1 and name2 have over 500k rows each and thus after 11 hours this code has not run.

Is there some way I can code this more efficiently or is my issue purely due to computing power?

proc sql;
  create table name1_name2_Fuzzy as
    select a.*, b.*
      from name1 as a 
       inner join name2 as b
               on COMPGED(a.match_name, b.match_name) < 200;
quit;
Dominic Comtois
  • 10,230
  • 1
  • 39
  • 61
James
  • 101
  • 2
  • 1
    Can you do a first run to find exact matches. Remove those and then try the fuzzy match? – Reeza Jun 02 '16 at 12:08
  • Does it have to be in sql with COMPGED? I have found success in using COMPLEV (in a date step) to calculate the odds of two strings that "match" (0 usually being an exact match...and less likely the further the value is from zero). I did find some statements on why COMPLEV may be a better solution then COMPGED, but I am not versed enough in the two to post a true answer...http://compgroups.net/comp.soft-sys.sas/complev-vs-compged/1567642 – SMW Jun 02 '16 at 12:08
  • You probably want to use "select distinct" instead of "select" (unless you want duplicate observations from this join), but that probably won't speed things up. I agree with Reeza that you should remove the perfect matches before worrying about the imperfect ones. – Sean Jun 02 '16 at 13:20
  • Superfluous I will try your suggestion, thanks! SMW: You're correct that complex is faster! I tried this on a subset of the data and it reduces computation time but I'm still looking at a very Long execution time! Reza: thank you, unfortunately the number of exact matches is very Low so it doesn't reduce the problem by much. – James Jun 05 '16 at 13:19

2 Answers2

1

You have a parameter in compged function that you didn't use, and that can improve the performance (maybe 6 or 7 hours instead of 11..).

this parameter is the cutoff. If you choose 300 as a cutoff, when the distance between the words reaches 300, sas stops the calculation and outputs 300.

So here in your case, you should choose a cutoff >200 (and NOT >=200).

Complev function is faster than Compged. If you don't need an exact cost of each operation (with call compost routine), you can use it instead of compged, and you can reduce minutes or maybe hours of computations. Complev has also the cutoff option.

Hope this helps !

Majdi
  • 133
  • 1
  • 9
0

Working off memory here, but if the first char of each match_name is different, the COMPGED will be over 200, true? So, you wouldn't consider them a match?

If so, make an indexed column with the first character of match_name in each table, and join on that before the COMPGED. That should eliminate most of the non-matches so far fewer COMPGED calculations will be needed.

Turophile
  • 3,367
  • 1
  • 13
  • 21
  • Yes, the cost of the conpged function to replace the first letter in a string is 200! I tried your suggestion, however, I end up with a similar issue; although the size of the new table will definitely be less than the cartesian product in the original code, it is still too large. – James Jun 02 '16 at 10:02