I am nearly new to R, so sorry if I make some basic questions, but I can not find a solution to this "simple" problem: Having a database (big one, 25 million rows, 14 cols) of patients, I have several rows for each "id", with for example this structure:
"id" "birth_date" "treatment" "date_treatment"
123 2002-01-01 2 2011-01-03
123 2002-01-01 3 2011-10-03
124 2002-01-01 6 2009-11-07
124 2002-01-01 NA NA
... ..... ...... ........
1022 2007-01-01 4 2011-01-06
I have to use ff package to be able to work with little amount of RAM, so ALL the processes should be into ff functions. And I want to know, for each single "id", which is the minimum "age" when he/she received a treatment = 2 or 4. so, that would be, in each single id, in generic code :
if(treatment in c(2,4)) then min(date_treatment - birth_date)
I only want to keep those minimum "ages" data and perhaps the ids.
One solution is to do:
age_c <- (data$date_treatment - data$birth_date)/365.25;
data$age_c <- age_c;
idx <- ffwhich( data, treatment %in% c(2,4) );
result <- data[idx,];
This keeps all the process into ff, and no memory problems, but... I still need to find a way to take those minimums ages for each id... ffdfdply seems to be able to do that:
age_fun <- function(x){
min_ <- min.ff(x$age_c);
data.frame( age = min_);
}
result2 <- ffdfdply(x = data,
split = data$id,
FUN = function(x) age_fun(x),
BATCHBYTES = 5000,
trace=TRUE
);
Which takes looooong time and also gives me a lot of different errors....
Any solution to that?
It is a general problem that in SAS or SQL are easy to do, but i do not find the right combination in R.
So the general question would be:
how to compute row-column functions for identical values (groups) of a variable (row) in very big data sets ???
Thanks !!