3

I have data that is grouped using dplyr in R. I would like to find the last occurrence of observations ('B') equal to or greater than 1 (1, 2, 3 or 4) in each group ('A'), in terms of the 'day' they occurred. I would like the value of 'day' for each group to be given in a new column.

For example, given the following sample of data, grouped by A (this has been simplified, my data is actually grouped by 3 variables):

A  B  day
a  2  1
a  2  2
a  1  5
a  0  8
b  3  1
b  3  4
b  3  6 
b  0  7 
b  0  9
c  1  2 
c  1  3
c  1  4

I would like to achieve the following:

A  B  day last
a  2  1   5
a  2  2   5
a  1  5   5
a  0  8   5
b  3  1   6
b  3  4   6
b  3  6   6
b  0  7   6
b  0  9   6
c  1  2   4
c  1  3   4
c  1  4   4

I hope this makes sense, thank you all very much for your help! I have thoroughly searched for my answer online but couldn't find anything. However, if I have accidentally duplicated a question then I apologise.

epo3
  • 2,991
  • 2
  • 33
  • 60
Keeley Seymour
  • 281
  • 3
  • 10

2 Answers2

3

We can try

library(data.table)
setDT(df1)[, last := day[tail(which(B>=1),1)] , A]
df1
#    A B day last
# 1: a 2   1    5
# 2: a 2   2    5
# 3: a 1   5    5
# 4: a 0   8    5
# 5: b 3   1    6
# 6: b 3   4    6
# 7: b 3   6    6
# 8: b 0   7    6
# 9: b 0   9    6
#10: c 1   2    4
#11: c 1   3    4
#12: c 1   4    4

Or using dplyr

library(dplyr)
df1 %>%
   group_by(A) %>%
   mutate(last = day[max(which(B>=1))])

Or use the last function from dplyr (as @docendo discimus suggested)

df1 %>%
   group_by(A) %>%
   mutate(last= last(day[B>=1]))

For the second question,

setDT(df1)[, dayafter:= if(all(!!B)) NA_integer_  else 
             day[max(which(B!=0))+1L] , A]
#    A B day dayafter
# 1: a 2   1        8
# 2: a 2   2        8
# 3: a 1   5        8
# 4: a 0   8        8
# 5: b 3   1        7
# 6: b 3   4        7
# 7: b 3   6        7
# 8: b 0   7        7
# 9: b 0   9        7
#10: c 1   2       NA
#11: c 1   3       NA
#12: c 1   4       NA
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 2
    Or `df %>% group_by(A) %>% mutate(last = last(day[B >= 1]))` – talat Feb 26 '16 at 10:36
  • Thanks so much both of you, I ended up using the last function from dplyr. Any idea how I can produce a column containing the day AFTER the day on which the observation of 1/2/3/4 occurred? I don't want the first day a 0 occurred even though that seems logical (my actual data is more complicated), I'd like the day after the last 1, 2, 3 or 4 value. So for 'a' in my example it would be 8, 'b' would be 7 and 'c' would be NA. – Keeley Seymour Feb 26 '16 at 14:20
  • 1
    @KeeleySeymour I updated with the post. Please check if that helps – akrun Feb 26 '16 at 18:34
  • Is there a reason for using different ways to extract the value? `tail(which(B>=1),1)` in data.table vs `max(which(B>=1))`? There's a `last()` function in `data.table` as well.. which is also compatible with other packages' `last()` function. – Arun Feb 26 '16 at 20:27
  • @Arun The `last` was contributed by docendodiscimus. I don't know whether `tail` or just `max` will have any difference in efficiency (haven't checked) – akrun Feb 26 '16 at 20:29
  • I did not talk about efficiency, but I'm quite sure `tail()` being an S3 generic will be much less efficient. My Q is purely based on curiosity.. as to why?!? – Arun Feb 26 '16 at 20:30
  • @Arun It is just that when I used `dplyr`, I thought about `max`. I was not thinking about efficiency at that time. – akrun Feb 26 '16 at 20:32
  • Alright, I guess ;-). Given the number of Q of very similar genre you answer every day, it sparked my curiosity. – Arun Feb 26 '16 at 20:37
2

Here is a solution that does not require loading external packages:

df <- structure(list(A = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 
2L, 3L, 3L, 3L), .Label = c("a", "b", "c"), class = "factor"), 
B = c(2L, 2L, 1L, 0L, 3L, 3L, 3L, 0L, 0L, 1L, 1L, 1L), day = c(1L, 
2L, 5L, 8L, 1L, 4L, 6L, 7L, 9L, 2L, 3L, 4L)), .Names = c("A", 
"B", "day"), class = "data.frame", row.names = c(NA, -12L))

x <- split(df, df$A, drop = TRUE)

tp <- lapply(x, function(k) {
  tmp <- k[k$B >0,]
  k$last <- tmp$day[length(tmp$day)]
  k
})

do.call(rbind, tp)

         A B day last
#a.1  a 2   1    5
#a.2  a 2   2    5
#a.3  a 1   5    5
#a.4  a 0   8    5
#b.5  b 3   1    6
#b.6  b 3   4    6
#b.7  b 3   6    6
#b.8  b 0   7    6
#b.9  b 0   9    6
#c.10 c 1   2    4
#c.11 c 1   3    4
#c.12 c 1   4    4
Mikko
  • 7,530
  • 8
  • 55
  • 92
  • According to the Q, they want "equal to or greater than 1". I think using `which.max` is a little unsafe here since it returns the (first) maximum but not necessarily the last value (or we need to make an assumption about the underlying data) – talat Feb 26 '16 at 11:14
  • Thank you for your comment. I edited the answer accordingly. – Mikko Feb 26 '16 at 11:56