1

I have a character variable that's long (up to 12,000 characters), and I would like to find a string within the variable that sounds like a certain word.
I'd also like to create a variable that equals one if the string is in the variable. Let's say, for argument's sake, the word that I'm trying to find is "Mary" (not case-sensitive). Here are four sample strings in a variable called "string" in a dataset called "question":

  • Mary had a little lamb its fleece was white as snow
  • Jack be nimble Jack be quick Jack jump over the candlestick
  • I think you and I should marry each other
  • I actually do not want to get married

The flag variable should = 1 for strings 1 and 3 (because Mary and marry).

Unfortunately, I don't think I can use this code:

DATA answer;
   SET question;
   IF FINDW(string, SOUNDEX("Mary")) ne 0 THEN flag=1;
     ELSE flag=0;
RUN;

It doesn't work because SAS is trying to find the soundex code for "Mary" in the string (not the actual string "Mary"). Any thoughts on how to get around this?

Sid
  • 4,893
  • 14
  • 55
  • 110

1 Answers1

0

Here's one way. Loops through each word and computes the soundex for that word. If the soundex matches, it breaks out of the loop, for efficiency.

data test_set;
    infile datalines dsd;
    length string $100;
    input string;
    datalines;
Mary had a little lamb its fleece was white as snow
Jack be nimble Jack be quick Jack jump over the candlestick
I think you and I should marry each other
I actually do not want to get married
;
run;

data test_set1(keep=string flag);
    set test_set;

    length i_word $100;

    flag = 0;

    mary_soundex = soundex("mary");

    word_count = countw(string);

    i = 1;

    do while (i le word_count and flag ne 1);
        i_word = scan(string, i);
        i_word_soundex = soundex(i_word);
        if mary_soundex eq i_word_soundex then flag = 1;
        i = i + 1;
    end;
run;

More on breaking sentences into words: http://blogs.sas.com/content/iml/2016/07/11/break-sentence-into-words-sas.html

Snorex
  • 904
  • 12
  • 29