1

How would I create an ID3 column for any overlapping ID1/ID2 column? (Not just the intersection, but for the union set.

For example, ID2 is an account number online, and ID2 is an IP address that a person used to log on. I would like to specify that as long as it was a same IP address OR the log-on ID, the rows are assigned the same ID3.

I am using SAS to code.

HAVE

Year    ID2     ID1 
2010    1       201 
2010    1       202 
2010    2       203 
2011    3       202 
2011    4       203 
2011    5       204 

WANT

Year    ID3     ID2     ID3
2010    101     1       201
2010    101     1       202
2010    102     2       203
2011    101     3       202
2011    102     4       203
2011    105     5       204
PeterClemmensen
  • 4,018
  • 3
  • 14
  • 22
Erica Jh Lee
  • 121
  • 3
  • 12
  • 1
    Where this becomes interesting is when you consider nesting: for example, what if `3 202` had another hit on `3 204`? Then you have `1 202`, `3 202` matching, meaning `1 201`, `1 202`, `3 202`, `3 204`, and `5 204` are all under the same ID3, right? – Joe Jul 29 '14 at 19:56
  • Yes, that is correct. So as long as they have any overlap, I would like to consider them to have the same ID3. Any ideas @joe? – Erica Jh Lee Jul 29 '14 at 20:02
  • 1
    Erica, I believe you have a "connected component" problem studied in graph theory. – Jeff Jul 30 '14 at 01:39

2 Answers2

2

This is a problem that's generally considered solved, but different problems have different more efficient solutions depending on the likelihood of multiple linkages.

An example (fairly clunky and not particularly efficient, but hopefully explains the general solution) follows. Basically, you need to go through the data and store linkages in a separate structure - arrays or hash tables are most common - and then at the end, output the results of the linkages. You can then merge that back to the main dataset (not provided).

data have;
input Year    ID2     ID1 ;
datalines;
2010    1       201 
2010    1       202 
2010    2       203 
2011    3       202 
2011    3       204
2011    4       203 
2011    5       204
2011    6       203
2011    7       205
;;;;
run;

data want;
set have end=eof;

array new_id[1000] _temporary_;
array new_id1_C[1000] _temporary_;
array new_id2_C[1000] _temporary_;

_i1 = whichn(id1,of new_id1_c[*]);
  _i2 = whichn(id2,of new_id2_c[*]);
  if not (_i1+_i2) then do;
    _eiter+1;
    _id3+1;
    put _i1= _i2= id1= id2= _eiter= _id3=;
    new_id[_eiter]=_id3;
    new_id1_c[_eiter] = id1;
    new_id2_c[_eiter] = id2;
  end;
  else do;
    if _i1 and _i2 then ;
    else do;
        _eiter+1;
       new_id1_c[_eiter] = id1;
       new_id2_c[_eiter] = id2;
       new_id[_eiter] = new_id[_i1+_i2]; *only one will be a value;
    end;
  end;

if eof then do;
  do _t = 1 to _eiter;
    id1 = new_id1_c[_t];
    id2 = new_id2_c[_t];
    id3 = new_id[_t];
    output;
  end;
end;
keep id1 id2 id3;
run;

In this, what I do is for each record, match it up using whichn to the array. If it is an entire-mismatch, it's a new ID; create a new ID. If it's an entire match, move on. If it's a one sided match (id1 is found but id2 is new), make a new row where id2 is added, with id1 and with the id3 that was previously assigned to id1.

The above doesn't actually work perfectly; a later cross on id1 and id2 would cause extra IDs to exist, but it's an example of the basic concept. There are papers easily available on the subject that are too long for an SO answer; for example, Transitive Record Linkage (Glenn Wright, WUSS 2010) is a good example of using hash tables to solve the issue.

Joe
  • 62,789
  • 6
  • 49
  • 67
  • Joe, I haven't had a chance to run it but I will comment on it once I do! Thank you very much as always, – Erica Jh Lee Jul 30 '14 at 18:45
  • I would emphasize that my code above is _not_ useful in real life - it's an example of the concept. It doesn't adequately handle multiple nestings, and isn't very efficient. There are lots of good solutions on the net. – Joe Jul 30 '14 at 19:22
0

Here is an answer which I think will work for all cases, although I don't know how efficient it is:

I'm starting with a data set with a few more observations added to show how the code can handle more difficult cases.

data have;
input year id2 id1;
datalines;
2010    1       201 
2010    1       202 
2010    2       203 
2011    3       202 
2011    4       203 
2011    5       204 
2011    6       205
2011    6       203
2011    7       206
;
run;

First I create a character variable called "links" which has id1 and id2 in it.

data links;
    length links $ 20;
    set have;
    id2t = compress("id2_"||put(id2, 5.));
    id1t = compress("id1_"||put(id1, 7.));
    links = id2t||" "||id1t;
run;

Then I assign each value of "links" to an element in temporary array "agroup". Next I take each element in agroup which has been assigned a value from "links" and compare it to each subsequent value of agroup. If the values have any id1s or id2s in common, then I concatenate the two together, or, if a value doesn't match any other values, then I put it as the next value of bgroup. Then I go back to the next value of agroup and do the same, repeating the process until I have gone through all the values. At the end there is one value of bgroup for each group, and each value has the id1s and ids2 from all the members. Finally I match up these groups with the original data set to get group numbers.

data final;
    array agroup[100] $200. _temporary_;
    array bgroup[100] $200. _temporary_;
    do until (eof);
        set links end=eof;
        a+1;
        agroup[a] = links;
    end;
    do k = 1 to a;
        found = 0;
        do i = k+1 to a until (found=1);
            do j = 1 to countw(agroup[k]) until (found = 1);
                if find(agroup[i], scan(agroup[k], j)) > 0 then do;
                    found = 1;
                    agroup[i] = strip(strip(agroup[k])||' '||strip(agroup[i]));
                end;
            end;
        end;
        if found = 0 then do;
            b+1;
            bgroup[b] = agroup[k];
        end;
    end;
    do until (eof2);
        set links end=eof2;
        do i = 1 to b;
            if find(bgroup[i], id2t)+find(bgroup[i], id1t) > 0 then id3 = i;
        end;
        output;
    end;
    keep year id2 id1 id3;
run;
catquas
  • 712
  • 1
  • 5
  • 7