10

I'm trying to create a window function with dplyr, that will return a new vector with the difference between each value and the first of its group. For example, given this dataset:

dummy <- data.frame(userId=rep(1,6),
     libId=rep(999,6),
     curatorId=c(1:2,1:2,1:2),
     iterationNum=c(0,0,1,1,2,2),
     rf=c(5,10,0,15,30,40)
)

That creates this dataset:

  userId libId curatorId iterationNum rf
1      1   999         1            0  5
2      1   999         2            0 10
3      1   999         1            1  0
4      1   999         2            1 15
5      1   999         1            2 30
6      1   999         2            2 40

And given this grouping:

 dummy<-group_by(dummy,libId,userId,curatorId)

Would give this result:

  userId libId curatorId iterationNum   rf   rf.diff
1      1   999         1            0  5    0
2      1   999         2            0 10    0
3      1   999         1            1  0   -5
4      1   999         2            1 15   -5
5      1   999         1            2 30    25
6      1   999         2            2 40    30

So for each group of users, libs and curators, I would get the rf value, minus the rf value with iterationNum=0. I tried playing with the first function, the rank function and others, but couldn't find a way to nail it.

---EDIT---

This is what I tried:

dummy %>% 
  group_by(userId,libId,curatorId) %>% 
  mutate(rf.diff = rf - subset(dummy,iterationNum==0)[['rf']])

And:

dummy %>% 
  group_by(userId,libId,curatorId) %>% 
  mutate(rf.diff = rf - first(x = rf,order_by=iterationNum))

Which crashes R and returns this error message:

pure virtual method called terminate called after throwing an instance of 'Rcpp::exception' what(): incompatible size (%d), expecting %d (the group size) or 1`

Omri374
  • 2,555
  • 3
  • 26
  • 40
  • It seems that you already know all the functions you need to do this. Can you show what you tried and what did not work as expected? Perhaps you just need to arrange (order) your data before computing the differences. – talat Jan 18 '15 at 21:51
  • 1
    You were close. Use `rf - rf[iterationNum == 0]` inside the mutate instead. The other option is to arrange the data using `arrange(iterationNum)` as a separate step in the pipe and the use `rf - first(rf)` in the mutate if you are sure that each group has a 0 in rf and no lower values. – talat Jan 18 '15 at 22:07
  • `rf - first(rf, iterationNum)` – hadley Jan 19 '15 at 04:54
  • Thanks @docendodiscimus! that worked! How do I make sure the order is correct with this syntax? – Omri374 Jan 19 '15 at 06:56
  • @hadley, I got an error: First it said "Error: all arguments of 'first' after the first one should be named". Then when I wrote `mutate(rf.diff=rf-first(rf,order_by=iterationNum)` my R session crashed with this message: `pure virtual method called` – Omri374 Jan 19 '15 at 06:58
  • @Omri374 it worked for me (after naming the argument). Maybe you need dplyr 0.4? – hadley Jan 19 '15 at 14:12
  • @hadley I'm using R 3.1.2 64-bit on Windows and dplyr 0.4.1. Is there a difference between dplyr 0.4 and dplyr 0.4.1 causing this issue? – Omri374 Jan 26 '15 at 15:23

1 Answers1

7

The two approaches I commented above are as follows.

dummy %>%
  group_by(libId, userId, curatorId) %>%
  mutate(rf.diff = rf - rf[iterationNum == 0])
#Source: local data frame [6 x 6]
#Groups: libId, userId, curatorId
#
#  userId libId curatorId iterationNum rf rf.diff
#1      1   999         1            0  5       0
#2      1   999         2            0 10       0
#3      1   999         1            1  0      -5
#4      1   999         2            1 15       5
#5      1   999         1            2 30      25
#6      1   999         2            2 40      30

Or using arrange to order the data by iterationNum:

dummy %>%
  arrange(iterationNum) %>%
  group_by(libId, userId, curatorId) %>%
  mutate(rf.diff = rf - first(rf))
#Source: local data frame [6 x 6]
#Groups: libId, userId, curatorId
#
#  userId libId curatorId iterationNum rf rf.diff
#1      1   999         1            0  5       0
#2      1   999         2            0 10       0
#3      1   999         1            1  0      -5
#4      1   999         2            1 15       5
#5      1   999         1            2 30      25
#6      1   999         2            2 40      30

As you can see, both produce the same output for the sample data.

talat
  • 68,970
  • 21
  • 126
  • 157