Removing duplicates and almost duplicates (missing values) in SQL

Question

I need a way to remove duplicate rows that also removes rows that are equivalent but for some missing values. I.E. I have

ID   FIRST LAST  YEAR CITY    COUNTRY
1    John  SMITH 1985 NewYork USA
1    John  NULL  1985 NULL    USA
1    NULL  SMITH NULL Miami USA
1    John  SMITH 1985 NewYork USA

I require:

1    John  SMITH 1985 NewYork USA
1    NULL  SMITH NULL Miami   USA

It is similar to this question:

SQL Remove almost duplicate rows

However, in my data any of the variables (except for ID) can be NULL. If it makes a difference I am doing this in SAS, so proc SQL or a SAS datastep would be fine.

So a couple things I think will work but are just taking forever to process (my file is on the order of 20 million rows):

proc sql create table want as select

t1.index,
t2.index as index2

from data t1, data t2 where 

(t1.first = t2.first or missing(t1.first) or missing(t2.first)) AND

(t1.last = t2.last or missing(t1.last) or missing(t2.last)) AND

(t1.year = t2.year or missing(t1.year) or missing(t2.year)) AND

(t1.city = t2.city or missing(t1.city) or missing(t2.city)) AND

(t1.country = t2.country or missing(t1.country) or missing(t2.country)) AND

(t1.ID = t2.ID);

quit;

From there I can recognize which rows are equivalent and can remove them. But proc sql is terribly slow with OR clauses in WHERE statements.

The other solution I had was using Michael Goldshteyn's solution from the linked question and just applying it to all variables sequentially, but I stopped it after 2 days of processing without finishing.

Any way to speed either of these up?

What columns do you need to be different? you could use a **proc sort** statement with the _nodupkey_ option. — Aaron, Mar 23 '20 at 21:20
If there are differences in any variables then I need that row to be kept. The issue is that it recognizes Null and say Smith as being different and I need it to think they are not — astel, Mar 23 '20 at 21:55
In other words, you're looking for some sort of fuzzy matching algorithm that will tell you this row is close enough to a prior row, or another row within that ID group by using the values within all columns. It's a pretty big job to build yourself but is a great tool. Take a look at this paper here on probabilistic record linkage in SAS. https://www.google.com/url?sa=t&source=web&rct=j&url=https://www.lexjansen.com/wuss/2011/data/Papers_Wright_G_76128.pdf&ved=2ahUKEwiUt57Tl7LoAhWlmOAKHUH2BBEQFjAEegQIBhAB&usg=AOvVaw1gQtYqVJGzcIMAbawN7Hrs&cshid=1585021315603 — Stu Sztukowski, Mar 24 '20 at 03:41
No there’s no fuzzy matching or record linkage going on. I’m simply looking to remove duplicates among rows. The issue is that I need it to consider a blank in a variable to be the same as a non-blank. Non-blanks must match exactly. — astel, Mar 24 '20 at 15:15

score 1 · Answer 1 · answered Mar 25 '20 at 20:17

Consider a table structure that has k key variables and n data variables.

The key variables define the BY group and from all the rows within the group you want to select the fewest rows whose non-missing data values completely cover the other rows with the same but fewer non-missing data values.

Consider a single row, with 5 data variables, of mixed types, abstractly named for this analysis:

C1, C2, N3, C4, C5

C for character,
N for numeric.

Note: All the data variables can not be referenced through a single DATA step array because of the different types.

Moving on:

Construct a bit mask corresponding to the non-missing values. It will have m bits turned on.
Identify sub-masks
- Those bit masks whose coverage is inferior to the bit mask
- There will 2^m-1 sub-masks

Consider an example row:

John SMITH . . USA

The sample data values have a mask with m=3

1 1 0 0 1

There are 2³-1 = 7 sub masks

1 1 0 0 *
1 * 0 0 1
1 * 0 0 *
* 1 0 0 1
* 1 0 0 *
* * 0 0 1
* * 0 0 *

For any other row in the group with the same values (or nulls) the corresponding mask will be a sub-mask of the sample row and thus 'inferior' in coverage and thus can be discarded.

A hash, with a key that is all the data variables, can be used to track principal rows with some mask, and the computed sub-masks. If a subsequent row has a sub-mask corresponding to the principal mask of a prior row, the prior row is marked as inferior and thus can be filtered from selection.

A loop over the 2ⁿ values from 0 to 2ⁿ-1 is a simple way iterate through all candidate masks. BAND operations against the principal mask will compute a sub-mask from the candidate.

Sample Code

data have;input
ID FIRST:$ LAST:$ YEAR CITY:$ COUNTRY:$; datalines;
1  John  SMITH 1985 NewYork USA
1  John  .     1985 .       USA
1  John  .     1985 .       UK
1  .     SMITH .    Miami   USA
1  John  SMITH 1985 NewYork USA
1  Mark  SMITH 1990 London  UK
1  Mark  SMITH 1990 London  UK
1  Mark  SMITH 1990 London  UK
1  Mark  SMITH 1990 London  UK
1  .     SMITH 1990 London  UK
1  Mark  .     1990 London  UK
1  Mark  SMITH .    London  UK
1  Mark  SMITH 1990 .       UK
1  Mark  SMITH 1990 London  .
1  Mark  SMITH .    London  UK
1  Mark  .     .    London  UK
;


data have;
  set have have(in=_2);
  if _2 then id=2;
run;

%macro loadkeysFor(var);

  %local i j n itop bits;

  %let n = %sysfunc(countw(&var));
  %let itop = %eval(&n-1);

  %do i = 0 %to &itop;
    %local var&i;
    %let var&i = %scan(&var,&i+1);  %* data variable names;
  %end;

  %do i = 0 %to &itop;
    _&i = &&var&i;                  %* generate code to save data values;
  %end;
  %do i = 0 %to &itop;
    _bit&i = not missing(_&i);      %* generate code to compute bits of mask;
  %end;

  %* mask indicates the non-missing permutations of 1 bits in mask submasks will be ;
  _mask = input(cats(of _bit&itop-_bit0), binary&n..);

/*  put (_&itop-_0) (=);*/
/*  put (_bit&itop-_bit0) (1.) +1 _mask binary&n..;*/

  if h.find() = 0 then continue;
    %* continue will skip this row because when 'found' the data values of the row are either
    %*  - identical to a prior row, or
    %*  - inferior to a prior row
    %*;

  _submask = _mask;
  _seq = index;                     %* non-missing seq is the mark of a principal row;

  h.add();  %* new principal row. save it, and replace all keys of corresponding key values as inferior;

  array _mark(0:%eval(2**&n)) _temporary_;

  call missing (of _mark(*));
  _seq = .;                          %* seq is the mark of an inferior submask;

  do _maskbits_ = 0 to %eval(2**&n-1);
    _submask = band (_mask, _maskbits_);  %* good ole BAND - binary and, compute the submask;
    if _submask = _mask then continue;    %* skip principal row;

    if missing(_mark(_submask)) then do;  %* reduce extra work, each submask done only once;
      _mark(_submask) = 1;

      %* generate code to assign data values (from saved values) according to submask;
      %* set host variables according to submask;
      %do i = 0 %to &itop;
        if band(_submask,blshift(1,&i)) then &&var&i = _&i; else call missing(&&var&i);
      %end;

      rc = h.replace();  %* add/replace hash entry of inferior sub-mask;
    end;
  end;

  format _submask _mask binary&n.. _seq _n_ 4.;
%mend;

options mprint;

data want;
  if 0 then set have;

  if _n_ = 1 then do;
    declare hash h (ordered:'a');
    h.defineKey ('first', 'last', 'year', 'city', 'country');
    h.defineData('first', 'last', 'year', 'city', 'country', '_seq'); * , 'row', '_submask', '_mask';
    h.defineDone();

    declare hiter hi('h');

    declare hash select();
    select.defineKey('_seq');
    select.defineDone();
  end;

  h.clear();

  do index = 1 by 1 until (last.id);
    set have;
    by id;

    row = index;

    %loadkeysFor(first last year city country)
  end;

  put index=;

  indexTop = index;

  %* retrieve maximally covering principal rows of group;
  select.clear();
  do _n_ = 1 by 1 while (hi.next() = 0);
    if _seq then OUTPUT;
  end;

  keep id first last year city country;
run;

Hey thanks for your answer! I'll give it a shot when I get a chance. Do you think it will be able to complete in a reasonable time given my data? ~20 million rows, ~2 million IDs and maybe 8 variables? — astel, Mar 26 '20 at 00:39

Removing duplicates and almost duplicates (missing values) in SQL

1 Answers1