0

I have a long list of patterns (called "List A") as partially shown at the attached image. "*" refers to a single character/string value.

On the other hand, I have a long list of character/string data (e.g. YIB2Z, YGEHN....), called "List B".

May I know how I can identify if the data in List B matches any of the patterns listed in List A? I am using SPSS modeler or SAS Enterprise Guide to process the data.

It would be highly appreciated if any advice could be given. Thanks in advance.

Patter List Pattern List (List A)

StephL
  • 27
  • 4

2 Answers2

1

Convert your patterns to regular expression patterns by replacing * with .. If your original patterns are for an entire string add the regular expression anchoring special characters ^ starts-with and $ ends-with. Then in SQL existential sub-query you can determine the pattern matching state for each text.

Because regex matching is slow compared to other SAS text processing function you can do an inverse or inside-out determination (see second example).

Example:

data patterns;
  length pattern $15;
  input pattern;
datalines;
YI*2Z
YG*2B
YG*2G
C**2C
;

data texts;
  length text $10;
  input text;
datalines;
YIA2Z
YIA3Z
YIB2Z
YGA2B
YGA3B
YGA2A
YGZ2G
YGZZG
CAA2D
CAA2C
CAA2B
;

proc sql;
  create table want as
  select text
  , exists (select * from patterns 
            where prxmatch('/^'||trim(translate(pattern,'.','*'))||'$/', trim(text))
           )
    as text_matches_a_pattern
  from 
    texts
  ;

Example 2:

This should be much faster overall.

  • Determine all distinct character maskings, there won't be too many, and place them in an array.
  • Load patterns into a hash for fast lookup
  • For each text apply the different maskings and see if any match a pattern.
data want2(keep=text pattern_matched_flag);
  array twiddles [1000] $15 _temporary_;

  not_star = compress(collate(0),'*');

  do until (last_pattern);
    set patterns end=last_pattern;
    twiddle = translate(pattern,' ',not_star);
    do i = 1 to dim(twiddles) while (not missing(twiddles(i)));
      if twiddle = twiddles(i) then leave;
    end;
    if missing(twiddles(i)) then twiddles(i) = twiddle;
  end;

  declare hash pattern_lookup (dataset:'patterns');
  pattern_lookup.defineKey('pattern');
  pattern_lookup.defineDone();

  do until (last_text);
    set texts end=last_text;

    found = 0;
    do i = 1 to dim(twiddles) while (not missing(twiddles(i)) and not found);
      twiddle = twiddles(i);
      twiddled_text = text;

      do j = 1 to length(twiddle);
        if substr(twiddle,j,1) = '*' then substr(twiddled_text,j,1) = '*';
      end;

      found = pattern_lookup.find(key:twiddled_text) = 0;
    end;

    pattern_matched_flag = found;

    output;
  end;

  stop;
run;
Richard
  • 25,390
  • 3
  • 25
  • 38
0

A possible solution using SAS would be to store the patterns in a temporary array then check each string against every element of this array.

* create the dataset with patterns;
data listA;
    input pattern $;
    datalines;
YI*2Z
YG*2B
YG*2G
YG*2H
YG*2K
YG*2L
YG*2M
YG*2N
YG*2O
YG*2Q
YG*2S
YG*2V
YG*2Y
C**2G
;
run;

* create some sample data to test;
data listB;
    input string $;
    datalines;
YIR2Z
YN435
C112G
C113G
YG42L
YG52L
YGS2I
;
run;

* The following 3 lines count the number of patterns in listA and save the number in macro variable &num_patterns;
%let dsid=%sysfunc(open(listA,is));
%let num_patterns = %sysfunc(attrn(&dsid.,NOBS));
%let rc=%sysfunc(close(&dsid.));

data want;
    * create a temporary array of parsed regex patterns to be used for matching;
    array patterns {&num_patterns} _temporary_;

    * read all patterns in listA, parse them using PRXPARSE and save them in the array PATTERNS;
    if _n_ = 1 then do pattern_number=1 by 1 until (lastobs);
        set listA end=lastobs;
        pattern=translate(pattern,'.','*'); * <== replace the * with . which is the metacharacter for any single character;
        patterns(pattern_number)=prxparse('/'||trim(pattern)||'/');
    end;

    * now read the strings from listB;
    set listB;
    match=0;

    * if the string matches any of the patterns then set MATCH = 1;
    do i =1 to &num_patterns until (match=1);
        if prxmatch(patterns(i),string) then match=1;
    end;

    drop i pattern_number pattern;
run;




shaun_m
  • 1,213
  • 3
  • 13