1

I have a list with some fuzzy variables and a list with some clean variables. I want to want to (fuzzy) match both of them where the compged distance is lowest and some rules regarding their edit distances are satisfied (f.ex. compged < 100 and spedis < 50). I tried the following code

proc sql noprint;
create table out as
select min(compged(fuzzy.fuzzy_title,clean.cleaned_title,100)) 
as comp,
fuzzy.fuzzy_title, clean.cleaned_title
from fuzzy inner join clean
on (compged(fuzzy.fuzzy_title,clean.cleaned_title,100) < 100 and 
spedis(clean.cleaned_title,fuzzy.fuzzy_title) < 50);
quit;

The datasets fuzzy and clean basically just contain the titles that I want to match. The code I use just gives me the minimum compged score of the whole dataset and then some arbitrary match where my condition regarding the distances are satisfied. Is there a way to choose exactly the clean_title with the minimum compged score for a given fuzzy_title? I might have searched wrong, but I couldn't find the answer to this.

banan
  • 71
  • 6
  • 2
    It would help if you could provide some example data for the datasets fuzzy and clean, along with what you expect the output to be – Longfish Nov 14 '17 at 10:53

1 Answers1

2

I think you are looking for group by + having:

   proc sql;
    create table out as
    select 
         compged(fuzzy.fuzzy_title,clean.cleaned_title,100)as comp
        ,fuzzy.fuzzy_title
        ,clean.cleaned_title
    from fuzzy inner join clean
        on (compged(fuzzy.fuzzy_title,clean.cleaned_title,100) < 100 
        and spedis(clean.cleaned_title,fuzzy.fuzzy_title) < 50)
    group by fuzzy.fuzzy_title
    having calculated comp = min(compged(fuzzy.fuzzy_title,clean.cleaned_title,100))
    ;quit;

If there are more fuzzy_title + cleaned_title pairs having the same comp value, all of them will be in the output. You can select only one of them in a single query. However, I think it's easier to keep those steps separated and select one row for each fuzzy_title in another query (e.g. using first data step variable).

Petr
  • 376
  • 1
  • 6