1

I have some data in the form of a column in a dataset (named Person_details), where each has an unknown number of names, with the name (split up by spaces), followed by an underscore, followed by that persons identifier (7 characters).

Is there a way to split these entries up automatically, rather than repeatedly finding the position of the underscore, and then taking the substring before and after?

Person_details:

Evan Davies_123F323 Adam John Smith_342D427 Karl Marx_903C943

There are an unknown number of names in each cell, e.g. some have just one name and some have 20. Also complicated by the fact that some entries have middle name(s).

The ideal output would be in the form

Name Code

Evan Davies 123F323

Adam John Smith 342D427

Karl Marx 903C943

  • 1
    What do you mean by "cell". Is the data already in a dataset? If so what is the name of the variable with current string. How many observations are there? Or are you planning to read this from a text file? If so what is the format of the text file? Are there any other fields. What is the output you want from that one example value you show? – Tom Aug 17 '22 at 15:05

2 Answers2

1

You could just use SCAN() instead.

data have;
  string='Evan Davies_123F323 Adam Smith_342D427 Karl Marx_903C943';
  length name $50 code $7 ;
  do index=1 to countw(string,' ');
    name = catx(' ',name,scan(string,index,' '));
    if index(name,'_') then do;
      code = scan(name,-1,'_');
      name = substr(name,1,length(name)-length(code)-1);
      output;
      name=' ';
    end;
  end;
run;

Result

enter image description here

Tom
  • 47,574
  • 2
  • 16
  • 29
1

You can use a Perl regular expression (regex) to detect and extract pieces from patterned text. SAS routine PRXNEXT iterates through matches, and function PRXPOSN extracts pieces.

Example:

data have;
  text = 'Evan Davies_123F323 Adam John Smith_342D427 Karl Marx_903C943';
run;

data want(keep=name code);
  rx = prxparse('/(.+?)_(.{7}( |$))/');

  set have;

  start = 1;
  stop = length(text);
  do seq = 1 by 1;
    call prxnext(rx,start,stop,text,position,length);
    if position=0 then leave;
    name = prxposn(rx,1,text);
    code = prxposn(rx,2,text);
    output;
  end;
run;

enter image description here

Richard
  • 25,390
  • 3
  • 25
  • 38