2

I am somewhat disturbed by the lead() and lag() which is not dragged dynamically/sequentially for each filled row.

My wish is to fill a new row based on an initial value and then sequentially fill next rows based on previously filled rows. In Excel this could be done by stating the formula/function in the cell and just drag it. How do I do it in R?

See below example

       x     y     z
   <dbl> <dbl> <dbl>
 1     1     1     1
 2     2     3    NA
 3     3     5    NA
 4     4     7    NA
 5     5     9    NA
 6     6    11    NA
 7     7    13    NA
 8     8    15    NA
 9     9    17    NA
10    10    19    NA 

The desired output is following this calculation where t-1 is the subscript for previous value: Z = Z_t-1 + X_t-1 - Y_t-1.

Desired output

       x     y     z
   <dbl> <dbl> <dbl>
 1     1     1     1
 2     2     3     1
 3     3     5     0
 4     4     7    -2
 5     5     9    -6
 6     6    11    -12
 7     7    13    -18
 8     8    15    -25
 9     9    17    -33
10    10    19    -42 
  • 1
    Your desired output doesn't match your stated formula. For example, in row 5, the value of x in the previous row is 4 and the value of z in the previous row is -2. These sum to 2, so when you subtract 7 (the value of y in the previous row) you should get -5, not -6. – Allan Cameron Oct 11 '21 at 16:29

2 Answers2

3

Note that the desired result in your question does not match the output of the formula you describe.

In this specific case, you can get your answer by using cumsum (cumulative sum):

cumsum(df1$x - df1$y + 1)
 [1]   1   1   0  -2  -5  -9 -14 -20 -27 -35

However, in general, if you wish to apply a function recursively based on the output of the previous result, you need to write a loop to do it (or use a function that uses a loop "under the hood", as akrun shows with accumulate). A simple loop in base R that achieves your result would be:

for(i in 2:nrow(df1)) df1$z[i] <- df1$z[i-1] + df1$x[i-1] - df1$y[i-1] 

df
#>     x  y   z
#> 1   1  1   1
#> 2   2  3   1
#> 3   3  5   0
#> 4   4  7  -2
#> 5   5  9  -5
#> 6   6 11  -9
#> 7   7 13 -14
#> 8   8 15 -20
#> 9   9 17 -27
#> 10 10 19 -35

When you can use a function like cumsum that uses C-based vectorization rather than loops within R, it is likely to be a more efficient solution.

Allan Cameron
  • 147,086
  • 7
  • 49
  • 87
  • Thanks, your forloop is what solves it even for larger datasets but it is computationally inefficient. Using accumulate does not generalize on to other datasets with modification of the code. Do you know why? E.g. ~1.2mill rows with 16 variables and grouped data. – geometricfreedom Oct 11 '21 at 20:12
2

As in the comments, the expected output is not matching based on the formula specified. In tidyverse, the recursive operation can be done with accumulate

library(purrr)
library(dplyr)
df1 %>% 
   mutate(z = unlist(accumulate2(x, y, ~ ..1 + ..2 - ..3, 
       .init = first(z))[-(n()+1)]))

-output

   x  y   z
1   1  1   1
2   2  3   1
3   3  5   0
4   4  7  -2
5   5  9  -5
6   6 11  -9
7   7 13 -14
8   8 15 -20
9   9 17 -27
10 10 19 -35

data

df1 <- structure(list(x = 1:10, y = c(1L, 3L, 5L, 7L, 9L, 11L, 13L, 
15L, 17L, 19L), z = c(1L, NA, NA, NA, NA, NA, NA, NA, NA, NA)), 
class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10"))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Thanks, I can replicate your code on the specific dataset. But why cant I generalize it on other datasets with modification? – geometricfreedom Oct 11 '21 at 20:10
  • @geometricfreedom depends on what function you are using and also please check if the columns are of class `numeric/integer` – akrun Oct 11 '21 at 20:13
  • @geometricfreedom you said modification of code. I am guessing that your modification may have an impact i.e. without knowing what you changed, it is not sure – akrun Oct 11 '21 at 20:14
  • Assume there are groups in the data and more columns, which creates this code (all numeric except group column which is integer): df %>% group_by(group_col) %>% mutate(z = unlist(accumulate2(x,y,v,u, ~ ..1 + ..2 - ..3 - ..4 - ..5, .init = first(z))[-n()+1)])) – geometricfreedom Oct 11 '21 at 20:23
  • @geometricfreedom `accumulate2` can take only 2 columns and not more than 2 – akrun Oct 11 '21 at 20:23
  • Is there a solution for more than 2? – geometricfreedom Oct 11 '21 at 20:24
  • You may need `for` loop for that – akrun Oct 11 '21 at 20:24
  • @geometricfreedom or another option is `pmap`, but I think `for` s a better option – akrun Oct 11 '21 at 21:53