0

I need to extract the string after the numbers. Although the problem is that the number of digits at the front of the string is inconsistent. What I need is something similar like the Flash Fill in Excel. But I'll be doing it for 100K+ rows so Excel might not be able to handle the data. For example:

12345678aaa@mail.com

12345bbb@mail.com

123456789ccc@mail.com

I want the create another variable with the extracted string such as the following:

aaa@mail.com

bbb@mail.com

ccc@mail.com

Is this possible?

Thank you in advance!

user3360708
  • 35
  • 1
  • 4
  • 9

2 Answers2

0

You can use regular expression substitution (PRXCHANGE), or a careful use of the VERIFY function.

Example:

data have;
input email $char25.; datalines;
12345678aaa@mail.com
12345bbb@mail.com
123456789ccc@mail.com
1234567890123456789012345
;

data want;
  set have;

  mail1 = prxchange('s/^\d+//',-1,email);

  if email in: ('0','1','2','3','4','5','6','7','8','9') then 
    mail2 = substr(email||' ',verify (email||' ', '0123456789'));
run;
Richard
  • 25,390
  • 3
  • 25
  • 38
  • Hi Richard, thanks for the answer! Will this also be able to capture if the email contains numbers as well? Like for example the first data is 1234aaa01@mail.com and the desired output is aaa01@mail.com? Thanks! – user3360708 Sep 23 '20 at 04:10
  • You can try it yourself, but yes, both ways will strip only leading digits. Small caution: If the email value is not validated there may be some rare cases where there are leading whitespace characters (\r, \t, space, hardspace) before the digits and you might judge the transformation as 'not working' – Richard Sep 23 '20 at 10:46
0

Example above should be OK,

but assuming that some email addresses could have numbers, 123abc001@mail.com for instance, my code below should help:

data have;
input email $char25.; datalines;
12345678abc01@mail.com
12345bcde@mail.com
123456789cdefg1@mail.com;


PROC SQL;
   CREATE TABLE WORK.QUERY_FOR_HAVE_0003 AS 
   SELECT t1.email, 
          /* want */
            (substrn(t1.email,INDEXC( t1.email, SUBSTRN(COMPRESS(t1.email, 'abcdefghijklmnopqrstuvwxyz', 'k'), 1, 1)))) 
            AS want
      FROM WORK.HAVE t1;
QUIT;

Firstly, we use COMPRESS functionm to leave only char values;
Then SUBSTRN - to have the first character appearing in email address;
After than INDEXC - returns position of the character;
Finally SUBSTRN again - leaves the rest of the email, starting from the position provided from step before.

final look: [1]: https://i.stack.imgur.com/hFftg.png

Niqua
  • 386
  • 2
  • 15
  • p.s. assuming that your email addresses are formatted in lowercase. if not, you may need to change 'abcdefg..xyz' in the `COMPRESS` function to match your case – Niqua Sep 25 '20 at 06:09