String cleanup and substring replacement

Question

I have a table with 100k + rows.

I'm trying to remove all instance of several sub-strings from one of the fields.

The method I found so far basically calls tranwrd for every offending sub-string.

There are three in the example below, but in the actual data sets there are quite a few more.

data mytable;
infile datalines delimiter=':' truncover;
informat myfield $50. someval 3.;
input myfield someval;
datalines;
some value xx abc:10
another values:15
random stuff ccc:1
more stuff xx:2
example abc:44
foo abc bar:55
sub xx string:11
;
run;

proc sql;
    update mytable set myfield = strip(tranwrd(myfield,'abc',''));

    update mytable set myfield = strip(tranwrd(myfield,'ccc',''));

    update mytable set myfield = strip(tranwrd(myfield,'xx',''));
quit;

Is there was to get the same done with a single statement ?

I.e. given a full list of strings to delete, delete them all.

something along the lines of:

update mytable
set myfield = somefunction(myfield,/'abc','ccc','xx'/,'')

Thank you

With some more poking around, came up with below:

data mytable2;
set mytable;
n_myfield = myfield;
length word $50;
do word = 'abc','ccc','xx';
    n_myfield = tranwrd(n_myfield,word,'');
end;
    n_myfield = compbl(n_myfield);
drop word;
run;

does the trick without nesting (I really don't want to nest 10-15 tranwrd calls) or multiple nearly identical update statements.

regex is something I was hoping to use

You can use data step: `data mytable; set mytable; myfield = tranwrd(myfield,'abc',''); myfield = tranwrd(myfield,'ccc',''); myfield = tranwrd(myfield,'xx',''); myfield = strip(myfield); run;` — Petr, Jan 23 '18 at 17:56
There's also regular expressions ([`prxchange`](http://support.sas.com/documentation/cdl/en/lefunctionsref/63354/HTML/default/viewer.htm#p1vz3ljudbd756n19502acxazevk.htm)), but that requires converting your patterns to valid regular expressions. — Nickolay, Jan 23 '18 at 17:58
The [compress](http://documentation.sas.com/?docsetId=lefunctionsref&docsetTarget=n0fcshr0ir3h73n1b845c4aq58hz.htm&docsetVersion=9.4&locale=en#p1v3kt0k9xvlwjn1gvwne2mwuk6f) function is also better for removing characters instead of replacing them. — J_Lard, Jan 23 '18 at 18:04

Dmitry.Kolosov · Accepted Answer · 2018-01-25T21:04:05.483

This can be done with regular expressions by concatenating the strings using the pipe character '|' which means "OR" in regular expressions:

myField = prxChange("s/abc|ccc|xx//",-1,trim(myField));

It is especially handy if you are trying to remove whole words, which is not easy with tranwrd, in this case you just change the regex to:

myField = prxChange("s/\b(abc|ccc|xx)\b\s?//",-1,trim(myField));

Where \b means word border (anything that is not a letter/number/underscore) and \s? part takes care of a possible extra space after the removed word. It will change

"abcd abc ccc xx abccccxx"

to

"abcd abccccxx"

But as noted by Nickolay it works fine, until you have metacharacters in your string: {}^$@.|*+?\ and the character used to mark borders of regex, in this example it is / (you can change s/.../.../ to s#..#...# or s$..$..$ if you like). When you have them, you either manually escape them with \, e.g., the line below removes strings "abc", "$c\c", "x.":

myField = prxChange("s/abc|\$c\\c|xx//",-1,trim(myField));

Or run apply additional regex, which will escape all special characters:

length wordsToRemove $200;
retain wordsToRemove 'abc|ccc|xx|$pec@lW()rd';

if _n_ eq 1 then do;
    * This does not change, so set it once;
    wordsToRemove=prxChange('s/([\Q{}[]()^.*+?\E\$\@\\\/])/\\$1/', -1,strip(wordsToRemove));
end;
myField=prxChange('s/'|| strip(wordsToRemove) || '//', -1, trim(myField));

score 1 · Answer 2 · answered Jan 24 '18 at 03:11

You could just read each word and evaluate whether to keep it or discard it.

data mytable;
  input someval myfield $50. ;
datalines;
10 some value xx abc
15 another values
1 random stuff ccc
2 more stuff xx
44 example abc
55 foo abc bar
11 sub xx string
;

data want ;
  set mytable ;
  length i 8 word n_myfield $50 ;
  drop i word ;
  do i =1 to countw(myfield,' ');
    word=scan(myfield,i,' ');
    if not findw('abc ccc xx',trim(word),' ') then n_myfield=catx(' ',n_myfield,word);
  end;
run;

Result:

Obs    someval    myfield              n_myfield

 1        10      some value xx abc    some value
 2        15      another values       another values
 3         1      random stuff ccc     random stuff
 4         2      more stuff xx        more stuff
 5        44      example abc          example
 6        55      foo abc bar          foo bar
 7        11      sub xx string        sub string

score 0 · Answer 3 · answered Jan 23 '18 at 20:26

You can do more or less the same thing via regex:

data mytable;
    modify mytable;
    myfield = strip(prxchange('s/abc|ccc|xx//',-1,myfield)));
    replace;
run;

For reasons that remain unclear to me, this leaves you with 2 spaces when a matching word in between two other words is removed, whereas your original code leaves three. However, I suspect that this probably doesn't matter for your purposes.

score 0 · Answer 4 · answered Jan 24 '18 at 01:17

regular expressions is a simple and efficient approach, do loop is other approach as you mentioned, but with array.

data mytable2;
set mytable;
n_myfield = myfield;
array var (3) $10  _temporary_ ('abc','ccc','xx');
do i=1 to 3;
    n_myfield = tranwrd(n_myfield,strip(var(i)),'');
end;
n_myfield = compbl(n_myfield);
drop i;
run;

score 0 · Answer 5 · answered Jan 24 '18 at 11:28

A single pass over the set of removal targets is not necessarily sufficient.
A robust reduction would require multiple passes until a full traversal caused no replacements.

Here is an example Proc DS2 step that defines a reusable cleaner method and uses it in a data program:

data have;
  infile datalines delimiter=':' truncover;
  informat myfield $50. someval 3.;
  input myfield someval;
datalines;
some value xx abc:10
another values:15
random stuff ccc:1
more stuff xx:2
example abc:44
foo abc bar:55
sub xx string:11
cabxxccc what to do?:123
xacccbcx funky chicken:456
;
run;

proc DS2 libs=WORK;

  package cleaner / overwrite=yes;

    method _remove ( 
        varchar(200) haystack
      , varchar(200) needles[*]
    )
    returns varchar(200);

      declare int i L P ;
      declare int removal_count;
      declare varchar(200) needle;

      do while (length ( haystack ) > 0);
        removal_count = 0;
        do i = 1 to dim(needles);

          needle = needles[i];
          L = lengthc (needle);
          P = index (haystack, needle);

          if L > 0 and P > 0 then do;
            haystack = tranwrd(haystack,needle,'');
            removal_count + 1;
          end;
        end;

        if removal_count = 0 then leave;
      end;

      return haystack;
    end;
  endpackage;

  data want / overwrite=yes;
    declare package cleaner c();
    declare varchar(20) targets[3];
    method init ();
      targets := ('abc', 'ccc', 'xx');
      put 'NOTE: INIT:' targets[*]=;
    end;
    method run ();
      set have;
      put myfield= ;
      myfield = c._remove(myfield, targets);
      put myfield= ;
      put;
    end;
  run;
quit;

Log

NOTE: INIT: targets[1]=abc targets[2]=ccc targets[3]=xx
myfield=some value xx abc
myfield=some value

myfield=another values
myfield=another values

myfield=random stuff ccc
myfield=random stuff

myfield=more stuff xx
myfield=more stuff

myfield=example abc
myfield=example

myfield=foo abc bar
myfield=foo  bar

myfield=sub xx string
myfield=sub  string

myfield=cabxxccc what to do?
myfield=cab what to do?

myfield=xacccbcx funky chicken
myfield= funky chicken

String cleanup and substring replacement

5 Answers5