7

I have the below df:

df <- data.table(user = c('a', 'a', 'a', 'b', 'b')
                 , spend = 1:5
                 , shift_by = c(1,1,2,1,1)
                 ); df

   user spend shift_by
1:    a     1        1
2:    a     2        1
3:    a     3        2
4:    b     4        1
5:    b     5        1

I am looking to create a lead lag column only this time the n parameter in data.table's shift function is dynamic and takes df$shiftby as input. My expected result is:

df[, spend_shifted := c(NA, 1, 1, NA, 4)]; df

   user spend shift_by spend_shifted
1:    a     1        1            NA
2:    a     2        1             1
3:    a     3        2             1
4:    b     4        1            NA
5:    b     5        1             4

However, with the below attempt it gives:

df[, spend_shifted := shift(x=spend, n=shift_by, type="lag"), user]; df

   user spend shift_by spend_shifted
1:    a     1        1            NA
2:    a     2        1            NA
3:    a     3        2            NA
4:    b     4        1            NA
5:    b     5        1            NA

This is the closest example I could find. However, I need a group by and am after a data.table solution because of speed. Truly look forward to finding any ideas.

Mark Rotteveel
  • 100,966
  • 191
  • 140
  • 197
Sweepy Dodo
  • 1,761
  • 9
  • 15
  • adaptive shift, sounds like FR. We already have adaptive rolling statistics in DT. – jangorecki Nov 03 '21 at 21:16
  • Hi, @jangorecki Am I correct in thinking you are one of the contributors to `data.table`? Love your work Did you mean there is a function for our question? Can I ask you to post an answer please? Am eager to find out – Sweepy Dodo Nov 04 '21 at 00:14
  • 1
    FR means feature request. We have similar functionality but not for shift function. – jangorecki Nov 04 '21 at 08:43

5 Answers5

5

I believe this will work. You can drop the newindex-column afterward.

df[, newindex := rowid(user) - shift_by]
df[newindex < 0, newindex := 0]
df[newindex > 0, spend_shifted := df[, spend[newindex], by = .(user)]$V1]
#    user spend shift_by newindex spend_shifted
# 1:    a     1        1        0            NA
# 2:    a     2        1        1             1
# 3:    a     3        2        1             1
# 4:    b     4        1        0            NA
# 5:    b     5        1        1             4
Wimpel
  • 26,031
  • 1
  • 20
  • 37
  • Thank you, @Wimpel However, if the column `shift_by` in `df` were to change from the current `shift_by = c(1,1,2,1,1)` to `shift_by = c(2,1,2,1,1)` we would get the error `Error in spend[newindex] : only 0's may be mixed with negative subscripts`. It would be unfair to say your solution does not work as this constitutes to a change to the question. However, can this answer be tweaked to accomodate this? I can't figure out what `df[, spend[newindex], by = .(user)]$V1]` does. Could you explain please – Sweepy Dodo Nov 02 '21 at 20:02
  • a bit dirty but setting negative `new_index` to 0 seems to do the trick ;-) – Wimpel Nov 02 '21 at 20:29
3

Using matrix subsetting of data.frames:

df[, 
   spend_shifted := 
     data.frame(shift(spend, n = unique(sort(shift_by))))[cbind(1:.N, shift_by)], 
   by = user]

Another solution (in addition to Wimpel's) without shift:

df[, {rows <- 1:nrow(.SD) - shift_by; .SD[replace(rows, rows <= 0, NA), spend]}, 
   by = user]
s_baldur
  • 29,441
  • 4
  • 36
  • 69
  • Honestly, I looked at your first code block and missed the second, I need to read more completely. I'll delete mine, yours was there first and clearly showed the same method/intent. Feel free to adopt the `seq_len` if you'd like. – r2evans Nov 03 '21 at 09:34
  • @r2evans I was not complaining at all. I like to see how the solutions improve upon each other and learn. – s_baldur Nov 03 '21 at 10:39
  • None taken! I don't like posting nearly-identical answers, preferring to comment (where appropriate) instead. (And frankly, `nrow(.)==0` is unlikely, perhaps even not possible with `data.table`'s grouping.) – r2evans Nov 03 '21 at 11:41
  • @sindri_baldur Thank u! I have read through all methods. r2evans, ThomasIsCoding, Wimpel and yours all use the same logic. However, may I ask within `.SD[replace(rows, rows <= 0, NA), spend]` is `SD` referring to the column spend? – Sweepy Dodo Nov 07 '21 at 02:10
  • In `data.table` `.SD` if `.SDcols` is not specified refers to all the columns except this in `by`. – s_baldur Nov 07 '21 at 17:28
3

Here's another approach, using a data.table join. I use two helper-columns to join on:

df[, row := .I, by = .(user)]
df[, match_row := row - shift_by]
df[df, on = .(user, match_row = row), x := i.spend]
df[, c('row', 'match_row') := NULL]


#    user spend shift_by spend_shifted  x
# 1:    a     1        1            NA NA
# 2:    a     2        1             1  1
# 3:    a     3        2             1  1
# 4:    b     4        1            NA NA
# 5:    b     5        1             4  4
talat
  • 68,970
  • 21
  • 126
  • 157
  • Thank you, @talat Same logic as others also. I suppose we did not get higher speed with this method because of the join. R still prefers taking a vector with index location as input Still, thank u :) – Sweepy Dodo Nov 07 '21 at 02:15
2

Maybe this could help

> df[, spend_shifted := spend[replace(seq(.N) - shift_by, seq(.N) <= shift_by, NA)], user][]
   user spend shift_by spend_shifted
1:    a     1        1            NA
2:    a     2        1             1
3:    a     3        2             1
4:    b     4        1            NA
5:    b     5        1             4
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81
  • Shouldn't be `seq(.N) <= shift_by`? – s_baldur Nov 03 '21 at 08:27
  • @sindri_baldur yes, you are right, my bad. thanks for correcting me :) – ThomasIsCoding Nov 03 '21 at 08:28
  • @ThomasIsCoding, thank u! This is a good implementation. Same logic as some others though very read-able. Honestly, my fav was @r2evas ' because his was the most read-able to me and I liked the idea of using `;` for succinctness Both ran in no time! – Sweepy Dodo Nov 07 '21 at 02:02
1

I have carried out a benchmark test as scalability is very important for me. df is same as original only repeating itself 10,000,000. Thus, 50,000,000 rows.

x <- 1e7
df <- data.table(user = rep(c('a', 'a', 'a', 'b', 'b'), x)
                 , spend = rep(1:5, x)
                 , shift_by = rep(c(1,1,2,1,1), x)
                 ); df

          user spend shift_by
       1:    a     1        1
       2:    a     2        1
       3:    a     3        2
       4:    b     4        1
       5:    b     5        1

benchmark:

 a <-
  microbenchmark(wimpel = {df[, newindex := rowid(user) - shift_by]
                            df[newindex < 0, newindex := 0]
                            df[newindex > 0, spend_shifted := df[, spend[newindex], by = .(user)]$V1]
                            }
                 , r2evans = {df[, spend_shifted := spend[{o <- seq_len(.N) - shift_by; o[o<1] <- NA; o; }], by = user]}
                 , sindri_1 = {df[, spend_shifted := data.frame(shift(spend, n = unique(sort(shift_by))))[cbind(1:.N, shift_by)], by = user]}
                 , sindri_2 = {df[, {rows <- 1:nrow(.SD) - shift_by; .SD[replace(rows, rows == 0, NA), spend]}, by = user]}
                 , talat = {df[, row := .I, by = .(user)]
                             df[, match_row := row - shift_by]
                             df[df, on = .(user, match_row = row), x := i.spend]
                             df[, c('row', 'match_row') := NULL]
                            }
                 , thomas = {df[, spend_shifted := spend[replace(seq(.N) - shift_by, seq(.N) <= shift_by, NA)], user]}
                 , times = 20
                 )
autoplot(a)

benchmark violin plot

@ThomasIsCoding and @r2evans' methods are almost identical.

a[, .(mean=mean(time)), expr][order(mean)]]

       expr       mean
1:   thomas 1974759530
2:  r2evans 2121604845
3: sindri_2 2530492745
4:   wimpel 4337907900
5: sindri_1 4585692780
6:    talat 7252938170

I am still in the process of parsing the logic of all methods provided. I cannot thank you all enough for your methods contributed (of which there are many). I shall be voting for an answer in due course.

ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81
Sweepy Dodo
  • 1,761
  • 9
  • 15