Consider a table structure that has k
key variables and n
data variables.
The key variables define the BY
group and from all the rows within the group you want to select the fewest rows whose non-missing data values completely cover the other rows with the same but fewer non-missing data values.
Consider a single row, with 5 data variables, of mixed types, abstractly named for this analysis:
C1, C2, N3, C4, C5
C
for character,
N
for numeric.
Note: All the data variables can not be referenced through a single DATA step array
because of the different types.
Moving on:
- Construct a bit mask corresponding to the non-missing values. It will have
m
bits turned on.
- Identify sub-masks
- Those bit masks whose coverage is inferior to the bit mask
- There will
2
m
-1
sub-masks
Consider an example row:
The sample data values have a mask with m=3
There are 2
3
-1 = 7
sub masks
1 1 0 0 *
1 * 0 0 1
1 * 0 0 *
* 1 0 0 1
* 1 0 0 *
* * 0 0 1
* * 0 0 *
For any other row in the group with the same values (or nulls) the corresponding mask will be a sub-mask of the sample row and thus 'inferior' in coverage and thus can be discarded.
A hash, with a key that is all the data variables, can be used to track principal rows with some mask, and the computed sub-masks. If a subsequent row has a sub-mask corresponding to the principal mask of a prior row, the prior row is marked as inferior and thus can be filtered from selection.
A loop over the 2n values from 0
to 2
n
-1
is a simple way iterate through all candidate masks. BAND
operations against the principal mask will compute a sub-mask from the candidate.
Sample Code
data have;input
ID FIRST:$ LAST:$ YEAR CITY:$ COUNTRY:$; datalines;
1 John SMITH 1985 NewYork USA
1 John . 1985 . USA
1 John . 1985 . UK
1 . SMITH . Miami USA
1 John SMITH 1985 NewYork USA
1 Mark SMITH 1990 London UK
1 Mark SMITH 1990 London UK
1 Mark SMITH 1990 London UK
1 Mark SMITH 1990 London UK
1 . SMITH 1990 London UK
1 Mark . 1990 London UK
1 Mark SMITH . London UK
1 Mark SMITH 1990 . UK
1 Mark SMITH 1990 London .
1 Mark SMITH . London UK
1 Mark . . London UK
;
data have;
set have have(in=_2);
if _2 then id=2;
run;
%macro loadkeysFor(var);
%local i j n itop bits;
%let n = %sysfunc(countw(&var));
%let itop = %eval(&n-1);
%do i = 0 %to &itop;
%local var&i;
%let var&i = %scan(&var,&i+1); %* data variable names;
%end;
%do i = 0 %to &itop;
_&i = &&var&i; %* generate code to save data values;
%end;
%do i = 0 %to &itop;
_bit&i = not missing(_&i); %* generate code to compute bits of mask;
%end;
%* mask indicates the non-missing permutations of 1 bits in mask submasks will be ;
_mask = input(cats(of _bit&itop-_bit0), binary&n..);
/* put (_&itop-_0) (=);*/
/* put (_bit&itop-_bit0) (1.) +1 _mask binary&n..;*/
if h.find() = 0 then continue;
%* continue will skip this row because when 'found' the data values of the row are either
%* - identical to a prior row, or
%* - inferior to a prior row
%*;
_submask = _mask;
_seq = index; %* non-missing seq is the mark of a principal row;
h.add(); %* new principal row. save it, and replace all keys of corresponding key values as inferior;
array _mark(0:%eval(2**&n)) _temporary_;
call missing (of _mark(*));
_seq = .; %* seq is the mark of an inferior submask;
do _maskbits_ = 0 to %eval(2**&n-1);
_submask = band (_mask, _maskbits_); %* good ole BAND - binary and, compute the submask;
if _submask = _mask then continue; %* skip principal row;
if missing(_mark(_submask)) then do; %* reduce extra work, each submask done only once;
_mark(_submask) = 1;
%* generate code to assign data values (from saved values) according to submask;
%* set host variables according to submask;
%do i = 0 %to &itop;
if band(_submask,blshift(1,&i)) then &&var&i = _&i; else call missing(&&var&i);
%end;
rc = h.replace(); %* add/replace hash entry of inferior sub-mask;
end;
end;
format _submask _mask binary&n.. _seq _n_ 4.;
%mend;
options mprint;
data want;
if 0 then set have;
if _n_ = 1 then do;
declare hash h (ordered:'a');
h.defineKey ('first', 'last', 'year', 'city', 'country');
h.defineData('first', 'last', 'year', 'city', 'country', '_seq'); * , 'row', '_submask', '_mask';
h.defineDone();
declare hiter hi('h');
declare hash select();
select.defineKey('_seq');
select.defineDone();
end;
h.clear();
do index = 1 by 1 until (last.id);
set have;
by id;
row = index;
%loadkeysFor(first last year city country)
end;
put index=;
indexTop = index;
%* retrieve maximally covering principal rows of group;
select.clear();
do _n_ = 1 by 1 while (hi.next() = 0);
if _seq then OUTPUT;
end;
keep id first last year city country;
run;