2

have is a sas data set with 4 variables: an id and variables storing info on all the activities a respondent shares with 3 different members of a team they're on. There are 4 different activity types, identified by the numbers populating in the :_activities vars for each player (p1 to p3). Below are the first 5 obs:

id  p1_activities   p2_activities   p3_activities
A   1,2,3,4         1,3 
B   1,3             1,2,3           1,2,3
C                   1,2,3           1,2,3
D                   1,2,3   
E   1,2,3                           1

Consider respondent A: they share all 4 activities with player 1 on their team, and activities 1 and 3 with player 2 on their team. I need to create flags for each player position and each activity. For example, a new numeric variable p1_act2_flag should equal 1 for all respondents who have a value of 2 appearing in the p1_activities character variable. Here are the first 6 variables I need to create out of the 12 total for the data shown:

p1_act1_flag p1_act2_flag p1_act3_flag p1_act4_flag p2_act1_flag p2_act2_flag …
1            1            1            1            1            0            …
1            0            1            0            1            1            …
.            .            .            .            1            1            …
.            .            .            .            1            1            …
1            1            1            0            .            .            …

I do this now by initializing all of the variable names in a length statement, then writing a ton if-then statements. I want to use far fewer lines of code, but my array logic is incorrect. Here's how I try to create the flags for player 1:

data want;
length p1_act1_flg p1_act2_flg p1_act3_flg p1_act4_flg
       p2_act1_flg p2_act2_flg p2_act3_flg p2_act4_flg
       p3_act1_flg p3_act2_flg p3_act3_flg p3_act4_flg
       p4_act1_flg p4_act2_flg p4_act3_flg p4_act4_flg 8.0;
set have;

array plracts  {*} p1_activities p2_activities p3_activities;

array p1actflg {*} p1_act1_flg p1_act2_flg p1_act3_flg p1_act4_flg;
array p2actflg {*} p2_act1_flg p2_act2_flg p2_act3_flg p2_act4_flg;
array p3actflg {*} p3_act1_flg p3_act2_flg p3_act3_flg p3_act4_flg;
array p4actflg {*} p4_act1_flg p4_act2_flg p4_act3_flg p4_act4_flg;

do i=1 to dim(plracts);
do j=1 to dim(p1actflg);

         if find(plracts{i}, cats(put(j, $12.))) then p1actflg{j}=1;
    else if missing(plracts{i})                  then p1actflg{j}=.;
    else                                              p1actflg{j}=0;

end;
end;

*do this again for the other p#actflg arrays;

run;

My "array subscript is out of range" because of the different lengths of the player and activity arrays, but nesting in different do-loops would result in me writing many more lines of code than a wallpaper solution.

How would you do this more systematically, and/or in far fewer lines of code?

J.Q
  • 971
  • 1
  • 14
  • 29

3 Answers3

3

Not sure why you are processing 4 activities for flags when there are only 3.

Some ideas:

  • Refactoring the column names to numbered suffixes would reduce some of the wallpaper effect.
    • activities_p1-activities_p3
  • Refactoring the flag column names to number suffixes
    • flag_p1_1-flag_p1_4
    • flag_p2_1-flag_p2_4
    • flag_p3_1-flag_p3_4
  • Use DIM to stay within array bounds.
  • Use two dimensional array for flags
  • Use direct addressing of items to be flagged
  • Add error checking

Not fewer, but perhaps more robust ?

This code examines each item in the activities list as opposed to seeking presence of a specific items (1..4):

data want;
  set have;
  array activities
    activities_p1-activities_p3
  ;
  array flags(3,4) 
    flag_p1_1-flag_p1_4
    flag_p2_1-flag_p2_4
    flag_p3_1-flag_p3_4
  ;

  do i = 1 to dim(activites);
    if missing(activities[i]) then continue; %* skip;
    do j = 1 by 1;
      item = scan ( activities[i], j, ',' );
      if missing(item) then leave; %* no more items in csv list;
      item_num = input (item,?1.);
      if missing(item_num) then continue; %* skip, csv item is not a number;
      if item_num > hbound(flags,2) or item_num < lbound(flags,2) then do;
        put 'WARNING:' item_num 'is invalid for flagging';
        continue; %* skip, csv item is missing, 0, negative or exceeds 4;
      end;
      flags (i, item_num) = 1;
    end;
    * backfill zeroes where flag not assigned;
    do j = 1 to hbound(flags,2);
      flags (i, item_num) = sum (0, flags (i, item_num));  %* sum() handles missing values;
    end;
  end;

Here is the same processing, but only searching for specific items to be flagged:

data have; length id activities_p1-activities_p3 $20;input 
id  activities_p1-activities_p3 ; datalines;
A   1,2,3,4         1,3             .
B   1,3             1,2,3           1,2,3
C   .               1,2,3           1,2,3
D   .               1,2,3           .
E   1,2,3           .               1
;
data want;
  set have;
  array activities
    activities_p1-activities_p3
  ;
  array flags(3,4) 
    flag_p1_1-flag_p1_4
    flag_p2_1-flag_p2_4
    flag_p3_1-flag_p3_4
  ;
  do i = 1 to dim(activities);
    if not missing(activities[i]) then
    do j = 1 to hbound(flags,2);
      flags (i,j) = sum (flags(i,j), findw(trim(activities[i]),cats(j),',') > 0) > 0;
    end;
  end;
run;

What's going on ?

  • flags variables are reset to missing at top of step
  • hbound return 4 as upper limit of second dimension
  • findw(trim(activities[i]),cats(j),',') find position of j in csv string
    • trim needed to remove trailing spaces which are not part of findw word delimiter list
    • cats converts j number to character representation
    • findw returns position of j in csv string.
      • might want to also compress out spaces and other junk if activity data values are not reliable.
    • first > 0 evaluates position to 0 j not present and 1 present
    • second > 0 is a another logic evaluation that ensures j present flag remains 0 or 1. Otherwise flags would be a frequency count (imagine activity data 1,1,2,3)
  • flags(i,j) covers the 3 x 4 slots available for flagging.
Richard
  • 25,390
  • 3
  • 25
  • 38
  • I'll check this out. To be clear, there *are* four activities. The numbers in the original variables of `have` represent unique activities. When I said "Consider respondent A: they share all 4 activities with player 1 on their team" - that meant "1,2,3,4" are the four types of activities. – J.Q Jan 29 '20 at 18:29
1

Consider converting into a hierarchical view and doing the logic there. The real stickler here is the fact that there can be missing positions within each list. Because of this, a simple do loop will be difficult. A faster way would be multi-step:

  1. Create a template of all possible players and positions
  2. Create an actual list of all players and positions
  3. Merge the template with the actual list and flag all matches

It's not as elegant as a single data step like could be done, but it is somewhat easy to work with.

data have;
infile datalines dlm='|';
input id$ p1_activities$ p2_activities$ p3_activities$;
datalines;
A|1,2,3,4|1,3| 
B|1,3|1,2,3|1,2,3| 
C| |1,2,3|1,2,3| 
D| |1,2,3| 
E|1,2,3| |1
;
run;

/* Make a template of all possible players and positions */
data template;
    set have;
    array players p1_activities--p3_activities;
    length varname $15.;

    do player = 1 to dim(players);
        do activity = 1 to 4;

            /* Generate a variable name for later */
            varname = cats('p', player, '_act', activity, '_flg');
            output;
        end;
    end;

    keep ID player activity varname;
run;

/* Create a list of actual players and their positions */
data actual;
    set have;
    array players p1_activities--p3_activities;

    do player = 1 to dim(players);
        do i = 1 to countw(players[player], ',');
            activity = input(scan(players[player], i, ','), 8.);

            /* Do not output missing positions */
            if(NOT missing(activity)) then output;
        end;
    end;

    keep ID player activity;
run;

/* Merge the template with actual values and create a flag when an
   an id, player, and activity matches with the template
*/
data want_long;
    merge template(in=all)
          actual(in=act);
    by id player activity;

    flag_activity = (all=act);
run;

/* Transpose it back to wide */
proc transpose data=want_long
               out=want_wide;
    id varname;
    by id;
    var flag_activity;
run;
Stu Sztukowski
  • 10,597
  • 1
  • 12
  • 21
1

Following Stu's example, a DS2 DATA step can perform his 'merge' using a hash lookup. The hash lookup depends on creating a data set that maps CSV item lists to flags.

* Create data for hash;

data share_flags(where=(not missing(key)));
  length key $7 f1-f4 8;
  array k[4] $1 _temporary_;

  do f1 = 0 to 1; k[1] = ifc(f1,'1','');
  do f2 = 0 to 1; k[2] = ifc(f2,'2','');
  do f3 = 0 to 1; k[3] = ifc(f3,'3','');
  do f4 = 0 to 1; k[4] = ifc(f4,'4','');
    key = catx(',', of k[*]);
    output;
  end;end;end;end;
run;

proc ds2;
  data want2 / overwrite=yes;
    declare char(20) id;
    vararray char(7) pact[*] activities_p1-activities_p3;
    vararray double fp1[*] flag_p1_1-flag_p1_4;
    vararray double fp2[*] flag_p2_1-flag_p2_4;
    vararray double fp3[*] flag_p3_1-flag_p3_4;
    declare char(1) sentinel;

    keep id--sentinel;
    drop sentinel;

    declare char(7) key;
    vararray double flags[*] f1-f4;

    declare package hash shares([key],[f1-f4],4,'share_flags'); %* load lookup data;

    method run();
      declare int rc;
      set have;

      rc = shares.find([activities_p1],[flag_p1:]);  %* find() will fill-in the flag variables;
      rc = shares.find([activities_p2],[flag_p2:]);
      rc = shares.find([activities_p3],[flag_p3:]);
    end;

  enddata;
run;
quit;
%let syslast = want2;

share_flags
enter image description here

result
enter image description here

Richard
  • 25,390
  • 3
  • 25
  • 38