11

I have the following data set

PatientName BVAID   Rank    TreatmentCode   TreatmentID DoseID  
Tim Stuart  BVA-027 3   OP_TBC            1             1  
Tim Stuart  BVA-041 4   OP_TBC            1             1  
Tim Stuart  BVA-021 7   OP_TBC            1             1  
Tim Stuart  BVA-048 10  OP_TBC            1             1  
Tim Stuart  BVA-020 14  OP_TBC            1             1  
Tim Stuart  BVA-024 15  OP_TBC            1             1  
Tim Stuart  BVA-001 16  OP_TBC            1             1  
Tim Stuart  BVA-013 27  OP_TBC            1             1  
Tim Stuart  BVA-018 28  OP_TBC            1             1  
Tim Stuart  BVA-051 29  OP_TBC            1             1  
Tim Stuart  BVA-027 3   OP_TC             2             1  
Tim Stuart  BVA-041 4   OP_TC             2             1  
Tim Stuart  BVA-048 10  OP_TC             2             1  
Tim Stuart  BVA-020 14  OP_TC             2             1    
Tim Stuart  BVA-001 16  OP_TC             2             1  
Tim Stuart  BVA-002 17  OP_TC             2             1    
Tim Stuart  BVA-019 18  OP_TC             2             1  
Tim Stuart  BVA-044 22  OP_TC             2             1  
Tim Stuart  BVA-025 23  OP_TC             2             1  
Tim Stuart  BVA-016 26  OP_TC             2             1  
Tim Stuart  BVA-013 27  OP_TC             2             1  
Tim Stuart  BVA-001 16  OP_SICO           3             1  
Tim Stuart  BVA-002 17  OP_SICO           3             1  
Tim Stuart  BVA-013 27  OP_SICO           3             1  

I need to output the records with the smallest rank in each TreatmentID group however if the record was outputted in the previous TreatmentID group I need to select the next smallest rank and output the record for the TreamtmentID group - I only need one record per TreatmentID group. This needs to be a scalable solution that I can automate. My output file will only have tree unique records i.e. one per each group and each records is unique in BVAID and will have the smallest rank in that group.

PatientName BVAID   Rank    TreatmentCode   TreatmentID DoseID  
Tim Stuart  BVA-027 3   OP_TBC            1             1  
Tim Stuart  BVA-041 4   OP_TC             2             1  
Tim Stuart  BVA-001 16  OP_SICO           3             1  

which program can handle this better SAS or R

Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
Makoto
  • 111
  • 5
  • When you say scalable, how large a dataset do you believe your solution would need to consider? – Hong Ooi Jan 12 '12 at 03:02
  • 1
    What does "if the record was outputted in the previous TreatmentID" mean? No duplicate BVAIDs? or No duplicate TreatmentCodes? – Jay Corbett Jan 12 '12 at 15:15

5 Answers5

13

Compact, scalable and readable R solution :

require(data.table)
DT = as.data.table(dat)   # dat input from Brian's answer
r = 0
DT[,{r<<-min(Rank[Rank>r]); .SD[Rank==r]}, by=TreatmentID]

     TreatmentID PatientName   BVAID Rank TreatmentCode DoseID
[1,]           1  Tim Stuart BVA-027    3        OP_TBC      1
[2,]           2  Tim Stuart BVA-041    4         OP_TC      1
[3,]           3  Tim Stuart BVA-001   16       OP_SICO      1
Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
  • Ding ding ding! I think we have a winner! – Josh O'Brien Jan 12 '12 at 17:46
  • Now, can you extend it so that: (a) it will work even if Treatment ID is not always in ascending order within Patient (e.g. by adding something like `key=c("PatientName", "TreatmentID"` in the call to `as.data.table`); and (b) it will work for an actual data set with more than one Patient? That second piece seems a bit more complicated -- I've got some ideas, and will try them out in a bit, if you haven't already posted something more elegant by then ;) – Josh O'Brien Jan 12 '12 at 17:55
  • @JoshO'Brien `by=` already does retain the order of the groups as they first appear in `DT` so it already does (a) and (b). I linked here from an issue I just raised [#1880](https://github.com/Rdatatable/data.table/issues/1880) to discuss a change to make that clearer. Interested in your view. – Matt Dowle Oct 15 '16 at 02:32
  • Hi Matt. Thanks for pointing me to that cool proposal and interesting discussion of how `by=` et al. work. First, a caveat: I rarely work with data for which the speedup would be a significant win. That said, it should definitely be made available to those who do need it (and I would likely use that option, since it seems cleaner to me to work with data that appear in properly sorted order). – Josh O'Brien Oct 17 '16 at 15:50
  • I might be inclined to permanently leave `by` with its current behavior, while adding a switch argument (`keeporder=` or something like that, with default value `TRUE`). Those few users who need the speedup could then set `keeporder=FALSE` to get the more streamlined behavior of `by`. – Josh O'Brien Oct 17 '16 at 15:51
  • I tend to think most users will find the reordering produced by your proposed solution jarring, in part because, if they're anything like me, they glance at the first few rows of original and resultant `data.table`s as a basic sanity check, for reassurance that their `[.data.table` call did what they wanted it to. That's maybe not optimal or correct user behavior, but I think it is how us humans actually behave. Facilitating that sort of checking, by default, might also be seen as in keeping with R's status a language well suited for exploratory analyses. – Josh O'Brien Oct 17 '16 at 15:57
  • Those are my thoughts, but I tend to defer to you as the much better and more experienced developer. – Josh O'Brien Oct 17 '16 at 15:58
  • Many thanks @JoshO'Brien. Will leave it open for a long time. Main thing now is to get `keyby=` working at proper speed and we'll see how things settle after that then. – Matt Dowle Oct 17 '16 at 20:20
5

Here's an R solution. I'd be really interested to know if there's a method that's much more compact than this.

library(plyr)

df <- df[order(df$PatientName, df$TreatmentID),]

ddply(df, .(PatientName), function(DF) {
    # For each Treatment, find the value of Rank to be kept    
    splitRanks <- split(DF$Rank, DF$TreatmentID)
    minRanks <- Reduce(f = function(X, Y) min(Y[Y>min(X)]), 
                       x = splitRanks[-1], 
                       init = min(splitRanks[[1]]), accumulate = TRUE)
    # For each Treatment, extract row w/ Rank determined by the calculation above
    splitDF <- split(DF, DF$TreatmentID)
    rows <- mapply(FUN = function(X, Y) X[X$Rank==Y,], splitDF, minRanks, 
                   SIMPLIFY = FALSE)
    # Bind the extracted rows back together in a data frame
    do.call("rbind", rows)
})

#   PatientName   BVAID Rank TreatmentCode TreatmentID DoseID
# 1  Tim Stuart BVA-027    3        OP_TBC           1      1
# 2  Tim Stuart BVA-041    4         OP_TC           2      1
# 3  Tim Stuart BVA-001   16       OP_SICO           3      1
Josh O'Brien
  • 159,210
  • 26
  • 366
  • 455
5

My SAS solution. All steps are scalable:

data test;
  input  PatientName $ 1-10
         BVAID $ 
         Rank    
         TreatmentCode $  
         TreatmentID 
         DoseID 
        ;
datalines;
Tim Stuart  BVA-027 3   OP_TBC            1             1  
Tim Stuart  BVA-041 4   OP_TBC            1             1  
Tim Stuart  BVA-021 7   OP_TBC            1             1  
Tim Stuart  BVA-048 10  OP_TBC            1             1  
Tim Stuart  BVA-020 14  OP_TBC            1             1  
Tim Stuart  BVA-024 15  OP_TBC            1             1  
Tim Stuart  BVA-001 16  OP_TBC            1             1  
Tim Stuart  BVA-013 27  OP_TBC            1             1  
Tim Stuart  BVA-018 28  OP_TBC            1             1  
Tim Stuart  BVA-051 29  OP_TBC            1             1  
Tim Stuart  BVA-027 3   OP_TC             2             1  
Tim Stuart  BVA-041 4   OP_TC             2             1  
Tim Stuart  BVA-048 10  OP_TC             2             1  
Tim Stuart  BVA-020 14  OP_TC             2             1    
Tim Stuart  BVA-001 16  OP_TC             2             1  
Tim Stuart  BVA-002 17  OP_TC             2             1    
Tim Stuart  BVA-019 18  OP_TC             2             1  
Tim Stuart  BVA-044 22  OP_TC             2             1  
Tim Stuart  BVA-025 23  OP_TC             2             1  
Tim Stuart  BVA-016 26  OP_TC             2             1  
Tim Stuart  BVA-013 27  OP_TC             2             1  
Tim Stuart  BVA-001 16  OP_SICO           3             1  
Tim Stuart  BVA-002 17  OP_SICO           3             1  
Tim Stuart  BVA-013 27  OP_SICO           3             1  
;
run;

proc sort data=test;
  by treatmentid;
run;

data test2;
  set test;
  by treatmentid;
  retain smallest;

  **
  ** CREATE AN EMPTY HASH TABLE THAT WE CAN STORE A LIST OF 
  ** RANKS IN THAT HAVE ALREADY BEEN USED. DONE THIS WAY FOR
  ** SCALABILITY.
  *;
  if _n_ eq 1 then do;
    declare hash ht();
    ht.definekey ('rank');
    ht.definedone();
  end;

  if first.treatmentid then do;
    smallest = .;
  end;

  **
  ** IF THE CURRENT RANK HAS NOT ALREADY BEEN USED THEN 
  ** EVALUATE IT TO SEE IF ITS THE SMALLEST VALUE.
  *;
  if ht.find() ne 0 then do;
    smallest = min(smallest,rank);
  end;

  **
  ** SAVE THE SMALLEST UNUSED RANK BACK TO THE RANK VALUE.
  ** THEN ADD IT TO THE HASH TABLE AND FINALLY OUTPUT THE
  ** OBSERVATION.
  *;
  if last.treatmentid then do;
    rank = smallest;
    ht.add();    
    output;
  end;

  drop smallest;
run;

Does SAS win? jk! ;-)

Robert Penridge
  • 8,424
  • 2
  • 34
  • 55
  • Yes hash is a very elegant solution. Thanks Rob I'll look up hash objects didn't know they can be so flexibile – Makoto Jan 12 '12 at 10:50
  • 1
    Appreciate the example of a Hash object...but did you account for "if the record was outputted in the previous TreatmentID group I need to select the next smallest rank". Not sure if BVAID or TreatmentCode var should not be duplicated...example does not duplicate BVAID var – Jay Corbett Jan 12 '12 at 15:13
  • Good questions - I'm not sure either! If the other vars need to be retained than it shouldn't be difficult to modify the above. – Robert Penridge Jan 12 '12 at 15:45
3

Here is another R solution. What makes this a harder problem than most is that it can not be treated as a split-apply-combine problem since the row to be selected depends not only on all rows with a given TreatmentID, but also the result of what was determined by the previous (assuming that means the next smallest) TreatmentID.

First, the data in a pasteable form (in case anyone else wants to take a crack at it):

dat <-
structure(list(PatientName = c("Tim Stuart", "Tim Stuart", "Tim Stuart", 
"Tim Stuart", "Tim Stuart", "Tim Stuart", "Tim Stuart", "Tim Stuart", 
"Tim Stuart", "Tim Stuart", "Tim Stuart", "Tim Stuart", "Tim Stuart", 
"Tim Stuart", "Tim Stuart", "Tim Stuart", "Tim Stuart", "Tim Stuart", 
"Tim Stuart", "Tim Stuart", "Tim Stuart", "Tim Stuart", "Tim Stuart", 
"Tim Stuart"), BVAID = c("BVA-027", "BVA-041", "BVA-021", "BVA-048", 
"BVA-020", "BVA-024", "BVA-001", "BVA-013", "BVA-018", "BVA-051", 
"BVA-027", "BVA-041", "BVA-048", "BVA-020", "BVA-001", "BVA-002", 
"BVA-019", "BVA-044", "BVA-025", "BVA-016", "BVA-013", "BVA-001", 
"BVA-002", "BVA-013"), Rank = c(3L, 4L, 7L, 10L, 14L, 15L, 16L, 
27L, 28L, 29L, 3L, 4L, 10L, 14L, 16L, 17L, 18L, 22L, 23L, 26L, 
27L, 16L, 17L, 27L), TreatmentCode = c("OP_TBC", "OP_TBC", "OP_TBC", 
"OP_TBC", "OP_TBC", "OP_TBC", "OP_TBC", "OP_TBC", "OP_TBC", "OP_TBC", 
"OP_TC", "OP_TC", "OP_TC", "OP_TC", "OP_TC", "OP_TC", "OP_TC", 
"OP_TC", "OP_TC", "OP_TC", "OP_TC", "OP_SICO", "OP_SICO", "OP_SICO"
), TreatmentID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L), DoseID = c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L)), .Names = c("PatientName", "BVAID", 
"Rank", "TreatmentCode", "TreatmentID", "DoseID"), class = "data.frame", 
row.names = c(NA, -24L))

Now my solution

matches <- dat[0,]
TreatmentIDs <- sort(unique(dat$TreatmentID))
for (TreatmentIDidx in seq_along(TreatmentIDs)) {
    TreatmentID <- TreatmentIDs[TreatmentIDidx]
    treat.flg <- dat$TreatmentID == TreatmentID
    match <- dat[treat.flg &
                 dat$Rank == min(setdiff(dat$Rank[treat.flg],
                                         matches$Rank[matches$TreatmentID==
                                                      TreatmentIDs[TreatmentIDidx-1]])),]
    matches <- rbind(matches, match)
}

which gives the desired result:

> matches
   PatientName   BVAID Rank TreatmentCode TreatmentID DoseID
1   Tim Stuart BVA-027    3        OP_TBC           1      1
12  Tim Stuart BVA-041    4         OP_TC           2      1
22  Tim Stuart BVA-001   16       OP_SICO           3      1

My SAS is rusty, and I don't have a copy to try things with right now, so I'll leave it to someone else to make a SAS solution to compare with.

Brian Diggs
  • 57,757
  • 13
  • 166
  • 188
  • Your description of why this is a harder-than-usual problem is spot on. Thanks too for including the cut-and-paste ready data in your answer! – Josh O'Brien Jan 11 '12 at 23:33
  • Thanks Brian - nice solution in R I'll experiment with different size data sets I wander how many records can R handle – Makoto Jan 12 '12 at 10:52
2

my sas solution.

assume that you have data set (test) and sort it like what you have done here (by patientname, treatmentid then rank). This code fits in multiple patientnames situation and assume that these steps are performed for every one of patientname (delete all patientname related if you don't want this level )

%macro m1();
%begin: proc append base=new data=test(firstobs=1 obs=1);

data _null_;
set test(firstobs=1 obs=1);
call symput('r', rank);
call symput('id',Treatmentid);
call symput('name',patientname);

data test;
set test;
if (rank=&r or Treatmentid=&id) and patientname=symget('name') then delete;

%let dsid=%sysfunc(open(test));
%let nobs=%sysfunc(attrn(&dsid,nobs)); 
%let rc=%sysfunc(close(&dsid));

%if &nobs^=0 %then %goto begin;
%mend;

%m1(); run;
Robbie Liu
  • 1,511
  • 1
  • 11
  • 16
  • Nice Macro Robbie it's fast I'm going to test which one is faster your macro or hash object. Thanks – Makoto Jan 12 '12 at 10:55
  • @Makoto You didn't give more detail about your data set, like whether you need to deal with multiple patientnames. Depending on the actual feature of data set, code could be more efficient. – Robbie Liu Jan 12 '12 at 12:15