Convert your patterns to regular expression patterns by replacing *
with .
. If your original patterns are for an entire string add the regular expression anchoring special characters ^
starts-with and $
ends-with. Then in SQL existential sub-query you can determine the pattern matching state for each text.
Because regex matching is slow compared to other SAS text processing function you can do an inverse or inside-out determination (see second example).
Example:
data patterns;
length pattern $15;
input pattern;
datalines;
YI*2Z
YG*2B
YG*2G
C**2C
;
data texts;
length text $10;
input text;
datalines;
YIA2Z
YIA3Z
YIB2Z
YGA2B
YGA3B
YGA2A
YGZ2G
YGZZG
CAA2D
CAA2C
CAA2B
;
proc sql;
create table want as
select text
, exists (select * from patterns
where prxmatch('/^'||trim(translate(pattern,'.','*'))||'$/', trim(text))
)
as text_matches_a_pattern
from
texts
;
Example 2:
This should be much faster overall.
- Determine all distinct character maskings, there won't be too many, and place them in an array.
- Load patterns into a hash for fast lookup
- For each text apply the different maskings and see if any match a pattern.
data want2(keep=text pattern_matched_flag);
array twiddles [1000] $15 _temporary_;
not_star = compress(collate(0),'*');
do until (last_pattern);
set patterns end=last_pattern;
twiddle = translate(pattern,' ',not_star);
do i = 1 to dim(twiddles) while (not missing(twiddles(i)));
if twiddle = twiddles(i) then leave;
end;
if missing(twiddles(i)) then twiddles(i) = twiddle;
end;
declare hash pattern_lookup (dataset:'patterns');
pattern_lookup.defineKey('pattern');
pattern_lookup.defineDone();
do until (last_text);
set texts end=last_text;
found = 0;
do i = 1 to dim(twiddles) while (not missing(twiddles(i)) and not found);
twiddle = twiddles(i);
twiddled_text = text;
do j = 1 to length(twiddle);
if substr(twiddle,j,1) = '*' then substr(twiddled_text,j,1) = '*';
end;
found = pattern_lookup.find(key:twiddled_text) = 0;
end;
pattern_matched_flag = found;
output;
end;
stop;
run;