Aggregate dataframe by row difference in R

Question

I have a dataframe which consists of the times when patients died.

It looks something like this

Time    Alive Died Lost
0       375   0    2
0.0668  373   1    9
0.3265  363   2    12
0.6439  349   0    6
0.7978  343   2    1
0.8363  340   2    2
0.8844  336   2    0
0.894   334   3    2   
0.9325  329   4    0
0.9517  325   4    1

I want to make a function where it will check if the time between two rows is less than a threshold.

If say t2 - t1 < threshold then it would log how many people died in that interval and how many were lost in that interval and log that. It would then give out a dataframe with intervals larger than the threshold with the corresponding numbers added.

Say if my threshold was 0.29 The second row would be removed logging that 1 person died and 9 were lost and would add this to the first' row Died/Lost columns

looking something like

Time    Alive Died Lost
0       375   1    11
0.3265  363   2    12
0.6439  349   0    6
...

I've written something but it fails if it has to add multiple rows. Whats the best way to do this efficiently?

EDIT

aggregateTimes <- function(data, threshold = 0.04){
  indices <- (diff(data[,1]) < threshold)
  indices <- c(FALSE, indices)
  for(i in 1:(nrow(data)-1)){
    row1 <- data[i, ]
    row2 <- data[i+1, ]
    if((row2[,1] - row1[,1]) < threshold){
      newrow <- row1 + c(0,0, row2[, 3:4])
      data[i,] <- newrow
      data <- data[-(i+1),]
    }
  }
  return(data)
}

But the indexing fails because data is of reduced dimension?

To answer @Moody_Mudskipper

    Time    Alive Died Lost
0       375   1   11
0.3265  363   2    12
0.6439  349   13   11
0.9517  325   4    1

`0.3265 - 0.0668 < 0.29`, as I understand your example there should only be 2 groups as the delta is `>0.29` only between 3rd and 4th row — moodymudskipper, Jan 23 '19 at 12:26
So 0.0668- 0 < 0.29 so you would add whatever happened in that interval to the last and remove that row. then continue iteratively through the whole dataset ending with no 2 rows having time difference of less than the threshold. That makes sense? — prophet, Jan 23 '19 at 12:47
then I would aggregate the 3 first rows, not the 2 first ones, that's what I don't get — moodymudskipper, Jan 23 '19 at 12:49
unless you compare `0.3265 - 0 > 0.29`, but that's not what your text conveys in my opinion — moodymudskipper, Jan 23 '19 at 12:50
could you add your final expected output ? it would be much easier — moodymudskipper, Jan 23 '19 at 12:51
No thats actually the case. Since you no longer "observed" row 2. you would only see t1 = 0 and t2 = 0.3265 — prophet, Jan 23 '19 at 12:51
nope still doesn't make sense to me, `0.0668 - 0 < 0.29 ` AND `0.3265 - 0.668 <0.29` AND THEN `0.6439 - 0.3265 > 0.29`, so I would aggregate the 3 first rows together and get `0 375 3 23` (as a first row) — moodymudskipper, Jan 23 '19 at 13:01
Order matters. Each row represents an observation so you can observe row 3 before row 2. So you would combine them in order — prophet, Jan 23 '19 at 13:04

minem · Accepted Answer · 2019-01-23T12:39:43.440

0

Do not know if this is exactly what you want, but this will group all the entries in 0.29 time intervals:

require(data.table)
setDT(d)
d[, tt := floor(Time/0.29)]
d[, `:=`(newTime = first(Time), Alive = first(Alive)), keyby = tt]
d[, lapply(.SD, sum), by = .(newTime, Alive), .SDcols = c('Died', 'Lost')]
#    newTime Alive Died Lost
# 1:  0.0000   375    1   11
# 2:  0.3265   363    2   12
# 3:  0.6439   349    4    9
# 4:  0.8844   336   13    3

Or this is more precise:

# create newTime indikator
newTimes <- d$Time
while(any(diff(newTimes) < 0.29)){
  i <- diff(newTimes) < 0.29
  i <- which(i)[1] + 1L
  newTimes <- newTimes[-i]
}
newTimes
# [1] 0.0000 0.3265 0.6439 0.9517

d[, tt := cumsum(Time %in% newTimes)] #grouping id
# adds new columns by grouping id (tt):
d[, `:=`(newTime = first(Time), Alive = first(Alive)), keyby = tt]
# sums Died and Lost by groups:
d[, lapply(.SD, sum), by = .(newTime, Alive), .SDcols = c('Died', 'Lost')]
#    newTime Alive Died Lost
# 1:  0.0000   375    1   11
# 2:  0.3265   363    2   12
# 3:  0.6439   349   13   11
# 4:  0.9517   325    4    1

edited Jan 23 '19 at 12:39

answered Jan 23 '19 at 12:14

minem

3,640
2
15
29

1

Why sum and not difference? – NelsonGon Jan 23 '19 at 12:16
1

@NelsonGon looking at expected results it loks like that.. Just Alive should not be summed, will with that. – minem Jan 23 '19 at 12:18
1

It looks different. Starting from row 3 onwards. Also perhaps then .SDcols shouldn't include "Alive". – NelsonGon Jan 23 '19 at 12:20
Its not easily readable for me. any chance you can explain it? – prophet Jan 23 '19 at 12:26
1

@prophet it's data.table syntax. The last part is summing by rows grouped by newTime and Alive for columns Died and lost. Second last is adding new columns,first is the first value "keyby" is almost like "by" a grouping. – NelsonGon Jan 23 '19 at 12:34
@prophet for intro to `data.table` I suggest: https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html – minem Jan 23 '19 at 12:40
Thank you for the explanation, im having a look at it now! – prophet Jan 23 '19 at 12:58
Can you implement the data.table functionality within a function? Seems that i need eval(parse( to use ":=" What about a non data.table approach? – prophet Jan 23 '19 at 13:50
@prophet sorry, but I do not understand your problem. The line with `:=` can be replased with two lines: `d[, newTime := first(Time), keyby = tt]` & `d[, Alive := first(Alive), keyby = tt]`. Maybe this helps... – minem Jan 23 '19 at 13:55

Aggregate dataframe by row difference in R

1 Answers1