-1

I am trying to find the various combinations only in the cases where the records in the below dataset only differ by the column type. So for example: the first three rows only differ by the column type

Given Dataset

ins_id    ins_number   type
1234      1234-1234-1  AU
1234      1234-1234-1  HM
1234      1234-1234-1  RE
567       567-567-12   TL
567       567-567-13   TL
9101      9101-1234-1  AU
9101      9101-1234-1  TX
9101      9101-1234-1  CN
8854      8854-1234-1  TX
8854      8854-1234-1  GB
8854      8854-1234-1  RE
8854      8854-1234-2  RX

Expected Output:
combination  count
AU,HM,RE     1
AU,TX,CN     1
TX,GB,RE     1

I tried writing the query but I am not getting the desired output, please help:

proc sql;create table tst as select cp.type, 
       count(distinct ins_id)
from (select distinct fac_prod_typ from dataset3a) cp cross join
     (select distinct ins_number from dataset3a) pes left join
     dataset3a
     on dataset3a.type = cp.type and
        dataset3a.ins_number = pes.ins_number
group by cp.type, pes.ins_number;quit;
Chug
  • 31
  • 6
  • A data step would be much easier here, using BY group processing. Is that an option here? – Reeza Apr 12 '21 at 20:01
  • Yes, anything as long as it's the correct output. – Chug Apr 12 '21 at 20:08
  • 1
    If among the different types within a group there is a repeated type do you want to how many times the type occurred ? (for example an id with 6 records, 3 types and some repeats could be summarized as `AU,HM(3),RE(2)`) – Richard Apr 12 '21 at 20:39

2 Answers2

2

You will want to sort the data to ensure the types list is consistent over all ids. A DOW loop over a SET...; BY...; will output one types list per group. The last step is using Proc FREQ to count the number of ids for each types list.

Example:

data have;
informat ins_id $8. ins_number $25. type $2.;
input ins_id  $  ins_number $  type $;
cards;
1234      1234-1234-1  AU
1234      1234-1234-1  HM
1234      1234-1234-1  RE
567       567-567-12   TL
567       567-567-13   TL
9101      9101-1234-1  AU
9101      9101-1234-1  TX
9101      9101-1234-1  CN
8854      8854-1234-1  TX
8854      8854-1234-1  GB
8854      8854-1234-1  RE
8854      8854-1234-2  RX
;

/* force specific ordering of type within group id and number */
/* necessary for proper frequency counting */
/* if sequence of types IS important do not sort and data step by ... NOTSORTED */

proc sort data=have;
  by ins_id ins_number type;
run;

data types(keep=types);
  length types $200;
  do until (last.ins_number);
    set have;
    by ins_id ins_number;
    if indexw(types, type) = 0 then types = catx(',',types,type);
  end;
  if index(types,',') then output;
run;

proc freq noprint data=types;
  table types / out=types_counts(keep=types count) ;
run;

enter image description here

Richard
  • 25,390
  • 3
  • 25
  • 38
  • Hi Richard, it doesn't work if we add one more row as I have done to my original post. That test case fails. So basically, for 8854, there should be TX, GB, RE and not RX – Chug Apr 12 '21 at 21:11
  • 1
    I see. The `BY` group needs to be the two variables `INS_ID INS_NUMBER`. Will fix. – Richard Apr 12 '21 at 21:49
  • Yup, that worked. You are awesome, thanks – Chug Apr 13 '21 at 16:54
1

Use FIRST/LAST logic is nice here. To get the counts, run a PROC FREQ on the final output and this would also allow you to identify the ins_id for the mixes as well.

data have;
informat ins_id $8. ins_number $25. type $2.;
input ins_id  $  ins_number $  type $;
cards;
1234      1234-1234-1  AU
1234      1234-1234-1  HM
1234      1234-1234-1  RE
567       567-567-12   TL
567       567-567-13   TL
9101      9101-1234-1  AU
9101      9101-1234-1  TX
9101      9101-1234-1  CN
8854      8854-1234-1  TX
8854      8854-1234-1  GB
8854      8854-1234-1  RE
;;;;

data want;
set have;
by ins_id ins_number type notsorted;
retain combo;
length combo $256.;
if first.ins_number then call missing(combo);

if first.type then combo = catx(", ", combo, type);

if last.ins_number and countw(combo)>1 then output;

run;
Reeza
  • 20,510
  • 4
  • 21
  • 38
  • Hi Reeza, it doesn't work if we add one more row as I have done to my original post. That test case fails. So basically, for 8854, there should be TX, GB, RE and not RX – Chug Apr 12 '21 at 21:12
  • So you only want the first three? By order or is RX excluded for some other reason? – Reeza Apr 12 '21 at 21:58
  • Assuming INS_NUMBER is the key, just change your FIRST/LAST to refer to the last variable that uniquely identifies a grouping. – Reeza Apr 12 '21 at 22:00