How to apply a custom recursive function with data.table and loop over each index group-wise?

Question

Since I can't find an answer in below questions:

Apply a recursive function over groups and rows without explicit for loop

How do I mimic the drag functionality for new rows such as in Excel but for R?

I'll try in asking a new question related to above. I.e, I want to apply a custom function recursively based on the output of previous values to the current row by group.

Example with a dataframe and a for loop:

for(i in 2:nrow(df1)) df1$z[i] <- df1$z[i-1] + df1$x[i-1] - df1$y[i-1]

Example with a dataframe and a for loop with custom function:

for(i in 2:nrow(df1)) df1$z[i] <- ifelse(df1$z[i-1] == df1$z[i],
                                          df1$z[i-1] + df1$x[i-1] - df1$y[i-1],
                                          df1$z[i-1] - df1$x[i-1] - df1$y[i-1])

However, with >1 mill rows, data.frames and for-loops are not optimal.

Is there any way to do above with data.table or dtplyr and optimized but also group-wise?

EDIT: See visualization of question.

It should first initiate from 2nd row like in for(i in 2:nrow(df) and it should use the custom function if and only if group[i]==group[i-1]

`for-loops are not optimal`... based on what? Most of R uses `for` loops deep down inside, and the premise that *"`*apply` is faster than `for`"* is now a ***myth*** since it was fixed years ago. The biggest flaw I see in your loop is that you are handling groups inefficiently (and in a fragile way). — r2evans, Feb 24 '22 at 17:20
Dataframes *and* for-loops are not optimal. Based on that. Please help with the group handling then. — geometricfreedom, Feb 25 '22 at 07:40

langtang · Accepted Answer · 2022-02-25T12:55:54.540

3

Does this use of Reduce do the trick?

tmp = data.table(
  grp = c(rep(0,6), rep(1,6)),
  x=c(10,20,30,40,50,60,1,2,3,4,5,6),
  y=c(1,2,3,4,5,6, 10,20,30,40,50,60)
)
tmp[, z:=Reduce(f=function(z,i) z + x[i-1] - y[i-1],
                x=(1:.N)[-1],
                init=0,
                accumulate = T)
    ,by=grp
]

Output:

    grp  x  y    z
 1:   0 10  1    0
 2:   0 20  2    9
 3:   0 30  3   27
 4:   0 40  4   54
 5:   0 50  5   90
 6:   0 60  6  135
 7:   1  1 10    0
 8:   1  2 20   -9
 9:   1  3 30  -27
10:   1  4 40  -54
11:   1  5 50  -90
12:   1  6 60 -135

Take for example, row 4. The value in the z column is 54, which is equal to the prior row's z-value + prior row's x-value, minus prior row's y-value.

The function f within Reduce can take any complicated form, including ifelse statements. Here is an example, where I've made a function called func, which is a wrapper around Reduce. Notice that within the Reduce statement, f is a function taking prev (thanks to suggestion by @r2evans), and this function first calculates previous row's s value minus previous row's t value (this is akin to your x[-1]-y[-1]. Then there is an ifelse statement. If the difference between the prior rows s and t value (i.e. k) is >20, then the new value in this row will be the previous z value minus the product of 20-4k (i.e. prev-(20-4k)), otherwise it will the previous z value + k (i.e. which is equal to your original formulation: z[i-1]+x[i-1]-y[i-1])

func <- function(s,t) {
  Reduce(
    f=function(prev,i) {
      k=s[i-1] - t[i-1]
      ifelse(k>10, prev -(20-4*k), prev+k)
    },
    x=2:length(s),
    init=0,
    accumulate = TRUE
  )
}

You can then assign the value of the func(x,y) to z, like this:

tmp[, z:=func(x,y), by=.(grp)][]

Output:

    grp  x  y    z
 1:   0 10  1    0
 2:   0 20  2    9
 3:   0 30  3   61
 4:   0 40  4  149
 5:   0 50  5  273
 6:   0 60  6  433
 7:   1  1 10    0
 8:   1  2 20   -9
 9:   1  3 30  -27
10:   1  4 40  -54
11:   1  5 50  -90
12:   1  6 60 -135

edited Feb 25 '22 at 12:55

answered Feb 24 '22 at 13:58

langtang

22,248
1
12
27

Thanks for your approach. No, it should refer to the i-1 value of z as well. – geometricfreedom Feb 24 '22 at 15:00
See edited version for visualization. – geometricfreedom Feb 24 '22 at 15:45
Do you have some example data? If you look at my result above, you will see that the `z` value for row `i` equals `z[i-1] + x[i-1] - y[i+1]`. For example, row 11 `z` value is -90. This is equal to `z[10] + x[10] - y[10]`, or `-54+4-40`. My apologies if I am missing something. – langtang Feb 24 '22 at 17:05
@geometricfreedom, I think you're missing the elegance and correctness of langtang's code. It *does* use the previous value of `z`, though it may not be clear since the `z` in that anon-func actually refers to the `z` passed to it, not the z present in the row; the `z` passed to the anonfunc is actually the result of the previous iteration of the `Reduce` processing, which is the assignment to the previous row's `z` value ... so it *is* the previous value. Lacking a more reproducible question with sample data disproving this solution, the ball is in your court, I think langtang has it right. – r2evans Feb 24 '22 at 17:32
1

If it makes it any more clear (since `Reduce` can be rather difficult to grok), I often code it using `prev` as the anonfunc first arg, which would make this `Reduce(f = function(prev, i) prev + x[i] - y[i], ...)`. – r2evans Feb 24 '22 at 17:34
1

@r2evans I'll use that moving forward (i.e. naming `prev`.. ) that's very helpful reminder. – langtang Feb 24 '22 at 17:41
@r2evans, I'm struggling to move the function in `f` outside `Reduce`.. Do you know how to do that? – langtang Feb 24 '22 at 18:08
I don't understand what you're trying to do: `Reduce` is a fundamental operation here. – r2evans Feb 24 '22 at 18:14
@r2evans, sorry, maybe I should open new question. Here, if I define a function like this `func = function(a,b) (a*b+b/a+27-a^3)/pi`, I can then refer that `func` in Reduce like this `Reduce(func, c(1:4), accumulate = T)`, and it works, but I was unsure how do the parallel with the problem above (i.e. the usage of Reduce for the OP). Obviously, I don't need to do this, because the f is simple. But I was curious how to define it outside Reduce, like I did with `func` – langtang Feb 24 '22 at 18:52
1

I see ... because this function refers to data that *should* be found in the function's call tree but may not ... yeah, it's really not that simple when reducing multiple columns, unfortunately. I don't know of a generic way that would fit "well". – r2evans Feb 24 '22 at 18:56
That is, because `Reduce` only reduces a simple list/vector and not a set of lists/vectors (i.e., multiple columns), it is not generalizable in that way. Is that what you are trying to do? – r2evans Feb 24 '22 at 18:56
(I've actually pondered that very question ... the closest I came is https://stackoverflow.com/a/70945868/3358272 ...) – r2evans Feb 24 '22 at 18:58
@r2evans helpful as always.. thanks – langtang Feb 24 '22 at 23:25
Thanks langtang. Probably I don't understand your code that well that is why. How do you fit in an ifelse() statement in there? Is it different from data.table::set()? – geometricfreedom Feb 25 '22 at 07:40
Not solved, but will mark as most helpful. It helped me discover other underlying functions. – geometricfreedom Feb 25 '22 at 07:52
re data.table::set(), no, you cannot use set() with by. re ifelse(), sure.. You can, but your example didn't require it.. , the by="grp", takes care of insuring that the formula previous z + previous x - previous y is NOT evaluate for the first row in the group.. .. Your function f can, however, be any complicated function of the values in the prior rows, including if else statements.. (see my edit) – langtang Feb 25 '22 at 12:43
@r2evans, I've resorted to using a wrapper around `Reduce`, which doesn't completely resolve my original query to you, but improves readability (i.e. moves the Reduce "out" of the `tmp[i,j,by]` – langtang Feb 25 '22 at 13:00

How to apply a custom recursive function with data.table and loop over each index group-wise?

1 Answers1