0

I created a SAS function using fcmp to calculate the jaccard distance between two strings. I do not want to use macros, as I'm going to use it through a large dataset for multiples variables. the substrings I have are missing others.

proc fcmp outlib=work.functions.func;
function distance_jaccard(string1 $, string2 $);
n = length(string1);
m = length(string2);
ngrams1 = "";


    do i = 1 to (n-1);
    ngrams1 = cats(ngrams1, substr(string1, i, 2) || '*');
    end;

    /*ngrams1= ngrams1||'*';*/

    put ngrams1=;

    ngrams2 = "";

    do j = 1 to (m-1);
        ngrams2 = cats(ngrams2, substr(string2, j, 2) || '*');
    end;
endsub;

options cmplib=(work.functions);


data test;
  string1 = "joubrel";
  string2 = "farjoubrel";
  jaccard_distance = distance_jaccard(string1, string2);
run;

I expected ngrams1 and ngrams2 to contain all the substrings of length 2 instead I got this

ngrams1=jo*ou*ub
ngrams2=fa*ar*rj

sarah99
  • 5
  • 2
  • Why would a macro be slower than a function? All a macro does is generate code. And the code it generates does not have the overheard of an extra function call. – Tom Feb 09 '23 at 14:29
  • Explain the algorithm. What is the expected result for the example input you showed? What happens if either of strings contain repetitive snippets? `string1='aaaaabbbbbb'` – Tom Feb 09 '23 at 14:43
  • I need the function call because I'm reusing it in multiple tasks. What I'm trying to say that I don't want to use data steps or other functions to reduce the output. – sarah99 Feb 09 '23 at 15:21
  • I'm recreating the jaccard index algorithm in order to compare two strings : https://www.statology.org/jaccard-similarity/ – sarah99 Feb 09 '23 at 15:24

1 Answers1

0

If you want real help with your algorithm you need to explain in words what you want to do.

I suspect your problem is that you never defined how long you new character variables NGRAM1 and NGRAM2 should be. From the output you show it appears that FCMP defaulted them to length $8.

To define a variable you need use a LENGTH statement (or an ATTRIB statement with the LENGTH= option) before you start referencing the variable.

Tom
  • 47,574
  • 2
  • 16
  • 29
  • Thank you for your response. I'm trying to create a function that calculates the jaccard index between two strings. The idea is to create two strings ngram1 and ngrams2 that contain all the substrings of the two strings string1 and string2 and to calculate the intersection and the union of ngram1 and ngram2. – sarah99 Feb 09 '23 at 15:18