0

I tried implementing a solution into my sas code but with no luck. I'm trying to add a jaccard distance column. to my dataset. I keep getting errors : variable name & is not valid invalid value for the keep option The idea is to solve a matching problem between two datasets and to take into consideration the typing errors.

data table_test;
    input nom1 $3. nom2 $3.;
cards;
abcade
vdenfr
azfefs
;
run;

%macro kshingling
(string
,k=5
,out=&sysmacroname.
)
;

data &out.;
   string = strip(prxchange('s#\s# #',-1,symget('string')));
   do _n_ = 1 to lengthn(string)-&k.+1;
      ngram = substr(string,_n_,&k.);
      output;
   end;
run;

%mend;



%macro jaccard
(string1
,string2
)
;

%kshingling(&string1.,k=2,out=s1)
%kshingling(&string2.,k=2,out=s2)

proc append base=s1 data=s2; run;

proc freq data=s1 noprint;
   tables string*ngram / out=s2;
run;

proc transpose data=s2 out=s1(drop=_name_ _label_); 
by string notsorted;
var count;
id ngram;
run;

proc stdize data=s1 out=s2 missing=0 reponly;
var _numeric_;
run;

proc distance data=s2 method=jaccard absent=0 out=s1; 
var anominal(_numeric_);
id string;
run;

data t(keep=&string1.);
set s1(firstobs=2);
run;

data _null_;
set t;
call symput('Jaccard',&string1.);
%put Distance de Jaccard = &Jaccard;
run;

%mend;

data test;
set table_test;
call symput('n1',nom1);
call symput('n2',nom2);
%jaccard(&n1,&n2);
run;

data Jacc;

Dist_Jacc=&Jaccard;
run;

data Final; merge table_test Jacc; run;




sarah99
  • 5
  • 2
  • If you want to store the results of that %jaccard() macro into data then write the results into a dataset instead of macro variable. If you do want to store the result into a macro variable then you probably need to make it a GLOBAL macro variable if you want to use the result after the macro has finished. The use of the value of STRING1 parameter as the NAME of a variable is going to limit the usefulness of the macro since then it cannot work on any string that is not a valid SAS variable name. – Tom Oct 25 '22 at 13:59
  • If you want to generate data then generate data. Look into using PROC APPEND to aggregate the results from multiple calls to the macro into a single dataset. – Tom Oct 25 '22 at 15:18

2 Answers2

1

You are mixing DATA step and macro in ways that are incorrect.

The SYMPUT occurs at runtime and the direct macro call %jaccard is processed at compilation time that occurs before runtime.

For instance:

data test;
set table_test;
call symput('n1',nom1);
call symput('n2',nom2);
%jaccard(&n1,&n2);
run;

Running jaccard for each record in table_test should be accomplished using something like the following DATA step that computes source code and then tells the session to execute it.

data _null_;
  set table_test;
  macro_call = '%nrstr(%jaccard)' || cats('(' , n1, ',', n2, ')');
  call execute (macro_call);
run;
Richard
  • 25,390
  • 3
  • 25
  • 38
0

Looks to me like the OUTPUT of your macro is the dataset T. You can use PROC APPEND to aggregate the results of multiple macro calls into a single dataset. You can then combine that data with your input dataset of ngrams.

data _null_;
  set table_test;
  call execute(cats('%nrstr(%jaccard)(',nom1,',',nom2,');'));
  call execute('proc append base=result data=t; run;');
run;

data want;
   set table_test;
   set result;
run;

BUT you will need to make sure the generated T dataset has THE EXACT SAME STRUCTURE each time.

So change the ending steps of the macro to this single step so that the dataset T always consists of ONE observation and ONE variable and the variable is named Jaccard. You can also use the %GLOBAL statement to make sure that the value of JACCARD macro variable is available after the macro finishes.

%if not %symexist(jaccard) %then %global jaccard;
data t ;
  set s1(keep=&string1. rename=(&string1.=Jaccard) obs=2 firstobs=2);
  call symputx('Jaccard',Jaccard);
run;
%put Distance de Jaccard = &Jaccard;
Tom
  • 47,574
  • 2
  • 16
  • 29
  • Thank you for the response I'm getting closer to the solution. But putting the variable names in the parameters does not work. I get an empty column with no cells. – sarah99 Oct 26 '22 at 09:57
  • I have no idea what putting variable names in parameters means to you. The macro you showed is designed to get the actual strings passed into it not names of anything. Looks like probably the cause of the strings being used for variable names is the ID statement in the PROC DISTANCE call. You should probably add another variable to the input dataset to PROC DISTINACE that does have a value that will be a valid variable name. Then you can select that variable from the output. – Tom Oct 26 '22 at 13:00
  • You could probably change your algorithm to start with the dataset and then just carry over the ID values for the original observations as BY variables into all of the steps so you generated all of the scores in one call to PROC DISTANCE. – Tom Oct 26 '22 at 16:03