6

I need a function (using base SAS or RStudio) that will enable me to determine the ID numbers as of a certain date and the original (root) ID numbers as of the start date. The dataset includes the old ID, the new ID, and the date the ID changed. Example data:

OldID NewID Change Date
1 2 1/1/10
10 11 1/1/10
2 3 7/1/10
3 4 7/10/10
11 12 8/1/10

I need to know the ID numbers as of 7/15/10 and the original (root) ID (as of 1/1/10). The output should look like this:

OrigID LastID
1 4
10 11

I will then need a flag that will help me count the number of OrigID's that changed over the given time interval (in this case, 1/1/10 to 7/15/10). I need to do similar counts for multiple dates after 7/15/10 as well.

Is there a function in base SAS or RStudio that can do this?

It doesn't appear that the functions in SAS/R I researched (hierarchic loggers, synchronous tracking, sequence tracking functions) will work (e.g., logger, lumberjack, log4r, validate, futile.logger)

Reza
  • 59
  • 3
  • You can treat it as Bill of Materials type problem in SAS, and use PROC BOM if you have a license and if you don't you can use a hash lookup. https://communities.sas.com/t5/SAS-Programming/Get-ancestors-for-each-children/m-p/490584 or this one https://communities.sas.com/t5/SAS-Programming/Find-last-child/m-p/113752/highlight/true#M23446 – Reeza Jun 09 '21 at 01:44
  • This would be a recursive lookup or a tree search for shortest path algorithm type problem IMO. – Reeza Jun 09 '21 at 01:45
  • You're working with what's known as a Type 2 dimension structured table. https://en.wikipedia.org/wiki/Slowly_changing_dimension – Reeza Jun 09 '21 at 01:52
  • 1
    Or it's an edge-list with a couple of attributes tacked on: https://en.wikipedia.org/wiki/Edge_list – thelatemail Jun 09 '21 at 02:12
  • I'm very happy you asked this question. I haven't dealt much with graph data structures before and was struggling to help solve this in an easy way. You've helped get some great answers! – Stu Sztukowski Jun 09 '21 at 23:40
  • Suppose the *asof* date is 6/1/2009 ? Does ID=1 exist and ID=10 not exist at this point ? In other words, does the data include the 'birth' id record when the old id is missing? For example `OldID=.`, `NewID=1` and `ChangeDate=01-JAN-2009`. Check out https://www.lexjansen.com/nesug/nesug09/cc/CC22.pdf "Using HASH to find a sum over a transactional path" – Richard Jun 10 '21 at 11:07

6 Answers6

3

This should work, I was just too lazy to type proper dates.

Note : this assumes the data is sorted by change occurence.

data

df <- data.frame(
  OldID = c(1, 10, 2, 3, 11), NewID = c(2, 11, 3, 4, 12), ChangeDate = c(1, 1, 2, 2, 3))
df
#>   OldID NewID ChangeDate
#> 1     1     2          1
#> 2    10    11          1
#> 3     2     3          2
#> 4     3     4          2
#> 5    11    12          3

function

process <- function(df, from, to) {
  process0 <- function(df, i = 1){
    # fetch new value
    new <- df$NewID[i]
    # check in old column
    j <- match(new, df$OldID)
    
    if(is.na(j)) {
      # if not matched, set i to next row
      i <- i + 1
    } else {
      # else we update current row with new "new" value
      df$NewID[i] <- df$NewID[j]
      # and increment the changes
      df$Changes[i] <- df$Changes[i] + 1
      # and remove obsolete row
      df <- df[-j,]
    }
    # do it all over again except if there is no next row
    if(i <= nrow(df)) process0(df, i) else df
  }
  # filter data frame
  df <- subset(df, ChangeDate >= from & ChangeDate <= to, select = c("OldID", "NewID"))
  # start with 1 change per line
  df$Changes <- 1
  # run recursive function
  process0(df)
}

result

process(df, 1, 2)
#>   OldID NewID Changes
#> 1     1     4       3
#> 2    10    11       1

Created on 2021-06-09 by the reprex package (v0.3.0)

moodymudskipper
  • 46,417
  • 11
  • 121
  • 167
  • Where were you in 1989 when I needed you? I paid a guy thousands of dollars (well, maybe I was using our department's research funds) to solve this problem and he never delivered. `` Of course, his chosen tool was SAS, so maybe this is just a demonstration of R's superiority? I'm probably going to come back to this and put a bounty on it. I have been trying to get the igraph package to solve it, so far without success, – IRTFM Jun 09 '21 at 01:43
  • 1
    @IRTFM - I've had a bash at an *igraph* solution below. – thelatemail Jun 09 '21 at 02:23
  • Ah, that's pleasant to hear @IRTFM thanks, if you have some more thousands to spare my email is in my profile ;). Tbh I wouldn't really know how to do it with SAS, I'd probably translate the above and it wouldn't be very pretty (but Tom is a fantastic SAS guru so I'd trust his approach much more). It seems like an issue to treat with graph tools indeed but I never used those much so went at it old school. – moodymudskipper Jun 09 '21 at 11:19
3

There are many tools in SAS for finding the connected subgraphs from the graph defined by your table of [OLDID,NEWID] edges. For example PROC OPTNET from SAS/OR. Or the %SUBGRAPHS macro created by PGStats.

So let's start by converting your listing into an actual dataset.

data have ;
  input OldID NewID Date :mmddyy.;
  format date yymmdd10.;
cards;
1 2 1/1/10
10 11 1/1/10
2 3 7/1/10
3 4 7/10/10
11 12 8/1/10
;

Then call the %SUBGRAPHS() macro to get the CLUST (subgraph id) calculated for each node.

%SubGraphs(have,from=oldid,to=newid,out=clusters);

Now re-combine it with the original data so that you have the dates.

proc sql;
  create table groups as 
    select distinct a.clust,b.*
    from clusters a
    inner join have b
      on a.node = b.oldid or a.node=b.newid
    order by a.clust,b.date
  ;
quit;

Once you match the records in your data to the same subgraph id then finding the first/last node for any date range is simple:

data want ;
  do until (last.clust);
    set groups;
    by clust date;
    where '01JAN2010'd <= date <= '15JUL2010'd;
    if first.clust then origid=oldid;
  end;
  lastid=newid;
  keep origid lastid ;
run;

Of course if you actually wanted to filter the data by the dates before searching for the subgraphs you might get a larger number of subgraphs because you might have eliminated the edge that connects two groups of nodes.

Tom
  • 47,574
  • 2
  • 16
  • 29
2

An attempt with igraph, though there's probably simpler ways. I'll start with the data.

dat <- read.table(text="OldID   NewID   ChangeDate
1   2   1/1/10
10  11  1/1/10
2   3   7/1/10
3   4   7/10/10
11  12  8/1/10", header=TRUE)

dat$ChangeDate <- as.Date(dat$ChangeDate, format="%m/%d/%y")

Setup the graph with attributes

library(igraph)
g <- graph.data.frame(dat)
V(g)$ChangeDate <- dat$ChangeDate[match(V(g)$name, dat$NewID)]
V(g)$ChangeDate[is.na(V(g)$ChangeDate)] <- -Inf

Subset the graph to the date range

g <- induced_subgraph(g, which(V(g)$ChangeDate <= as.Date("2010-07-15")))

Find the starts and ends and paths

din  <- degree(g, mode="in")
st   <- names(din[din == 0])
dout <- degree(g, mode="out")
en   <- names(dout[dout== 0])

fin  <- lapply(st, all_simple_paths, graph=g, to=en)

Query the paths

t(sapply(
  unlist(fin, rec=FALSE),
  function(x) {
    nx <- as.numeric(names(x))
    c(OldID=head(nx,1), NewID=tail(nx,1), Changes=length(x)-1)
  }
))

#     OldID NewID Changes
#[1,]     1     4       3
#[2,]    10    11       1
thelatemail
  • 91,185
  • 12
  • 128
  • 188
  • Bravo. Thanks for this. I had gotten only to the point where I had a graph with two paths, but finding the "ends" was not in my repertoire. It's going to be tough to decide which of these is more meritorious, Maybe I'll make an MCVE that completely captures the features of the tracking problem I failed to solve back in my academic years. – IRTFM Jun 09 '21 at 15:01
2

Another possibility using igraph and its function components. Borrowing "dat" from @thela. I use data.table for the data wrangling step, but this can of course also be done in base.

library(igraph)
library(data.table)

mem = components(graph_from_data_frame(dat))$membership
setDT(dat)
dat[.(as.integer(names(mem))), on = .(OldID), mem := mem]
dat[ChangeDate <= as.Date("2010-07-15"),
   .(OldID = OldID[1] , NewID = NewID[.N], changes = .N), by = mem]
#    mem OldID NewID changes
# 1:   1     1     4       3
# 2:   2    10    11       1
  1. Use graph_from_data_frame to create a graph, where 'OldID' and 'NewID' columns are treated as an edge list.

  2. Use components to get connected components of the graph, directly or indirectly.

  3. Select the membership element to get "the cluster id to which each vertex belongs"

  4. Join membership to original data on 'OldID'.

  5. Subset dates in i and grab relevant data grouped by membership, i.e. first 'OldID', last 'NewID' and number of rows using .N

Henrik
  • 65,555
  • 14
  • 143
  • 159
1

SAS hashes can track the ID roots and changes. The changes can be setup for linked list traversal needed to detect the asof state.

Example:

Requires a root data row in which the ID first comes into existence. The asofs are specified in a temporary array, but could also be specified in an external data set that is loaded into a third hash for asof iteration.

data have; 
attrib OldID NewID length=8 ChangeDate informat=mmddyy10. format=mmddyy10.;
input OldID NewID   ChangeDate;
datalines;
.   1   3/1/09
.  10   6/1/09
1   2   1/1/10
10  11  1/1/10
2   3   7/1/10
3   4   7/10/10
11  12  8/1/10
;

data want(keep=asof origID currID changeCount);
  attrib asof format=mmddyy10. origID currID length=8;

  declare hash roots();
  roots.defineKey('NewID');
  roots.defineData('NewID', 'ChangeDate');
  roots.defineDone();

  declare hash changes();
  changes.defineKey('OldID');
  changes.defineData('NewID', 'ChangeDate');
  changes.defineDone();

  do while (not done);
    set have end=done;
    if missing(OldID)
      then roots.add();
      else changes.add();    
  end;

  array asofs (7) _temporary_ (
    '01-JAN-2009'd
    '15-MAY-2009'd
    '15-SEP-2009'd
    '15-MAR-2010'd
    '02-JUL-2010'd
    '15-JUL-2010'd
    '15-AUG-2010'd
  );

  declare hiter hi('roots');

  do index = 1 to dim(asofs);
    asof = asofs(index);

    do while (hi.next() eq 0);

      origID = newID;
      currID = .;

      do changeCount = -1 by 1 while (ChangeDate <= asof);
        currID = NewID;
        if changes.find(key:NewID) ne 0 then leave;
      end;

      output;
    end;
  end;
  stop;
run;
Richard
  • 25,390
  • 3
  • 25
  • 38
  • I appreciate all the suggestions! Richard, I changed "do changeCount = -1 by 1" to "do changeCount = 0 by 1, which seemed to produce the total number of changes of the origID. Question: how would I modify the SAS Hash Object code if I have the ID's as of today, and want to understand what the ID was @ the same historical asof dates? For example, I changed the root data in "Have" from: . 1 3/1/09 . 10 6/1/09 to: . 4 8/1/10 . 12 8/1/10. There are no changes in "Want" for the field "changeCount" and only the currID's are displayed for 8/15/10 (12 & 4). – Reza Jun 22 '21 at 13:48
  • I reordered the dates in "Have" to descending and changed the sign of ChangeDate <= asof to ChangeDate => asof, but no results produced. I assume I need to change the field names in the declarations from NewID to OldID and vice versa. Perhaps also the do loop also needs to be altered? – Reza Jun 22 '21 at 16:02
  • Regarding *Question: how would I modify the SAS Hash Object code if I have the ID's as of today, and want to understand what the ID was @ the same historical asof dates?* That is another question that should be posted separately. – Richard Jun 23 '21 at 11:25
0

Here is my approach with purrr::reduce:

library(dplyr)

split(df, seq(nrow(df))) %>%
  purrr::reduce(~ { index <- which(.y$OldID == .x$NewID)
                    if (any(index))
                      mutate(.x, NewID = replace(NewID, index, .y$NewID))
                    else bind_rows(.x, .y) }) %>%
  transmute(OrigID = OldID, LastID = NewID)

Returns:

  OrigID LastID
1      1      4
2     10     12
ktiu
  • 2,606
  • 6
  • 20