2

I have to produce a self-referencing variable (ind) that is grouped by an id and has to fulfill a certain condition (e.g., time >1). Here is a toy example:

set.seed(13)
dt <- data.frame(id = rep(letters[1:2], each = 4), time = rep(1:4, 2), ret = rnorm(8)/100)
dt$ind <- if_else(dt$time == 1, 100, as.numeric(NA))
dt

dt <- dt %>%
  group_by(id) %>%
  mutate(
    ind = if_else(time > 1, lag(ind, 1)*(1+ret), ind)
  )

This is the output:

Values for ind missing

Obviously I cannot use mutate in this set up since it is referencing to the initial values of ind and does not update when new values are calculated.

I would like to avoid running a loop. Any ideas how I can compute ind for all time periods most efficiently?


Edit:

Thanks to everyone for the helpful answers! I have a slightly trickier extension of the above issue.

How can I deal with higher lags? E.g., with lag = 2, such that

index_{t} = index_{t-2}*(1+ret_{t})

Here is a sample data frame and a sample outcome that I produced with Excel:

set.seed(13)
dt <- data.frame(id = rep(letters[1:2], each = 5), time = rep(1:5, 2), ret = rnorm(10)/100)
dt$ind <- if_else(dt$time == 1, 120, if_else(dt$time == 2, 125, as.numeric(NA)))

enter image description here

Anoushiravan R
  • 21,622
  • 3
  • 18
  • 41
glaucon
  • 190
  • 1
  • 9
  • 3
    Is this what you want to calculate? `ind_1 = 100, ind_i = ind_{i-1} * (1 + ret_i)`? Then use `cumprod(c(1, (1 + dt$ret[-1]))) * 100`, – Roland Jul 27 '21 at 14:04
  • Thank you for the quick reply. Yours is a very tailor-made solution. It works for this example but I would like to have a more general solution. 1. I would like to use in the code the lagged value of ind and not ret. 2. ind does not necessarily has to start with 100. Eg, it could have the following values: ind =110 for id=a and time =1 and ind =150 for id =b and time =1. – glaucon Jul 27 '21 at 14:30
  • 1
    Obviously, you'd need to adjust to your specific needs. It is not generally possible to calculate recursive sequences/series efficiently in R. If an explicit solution doesn't exist, you need to switch to another programming language to do this efficiently. Typically, you'd use Rcpp, which makes this easy. – Roland Jul 27 '21 at 14:33

3 Answers3

2

Update Two I asked a question as your new requirements made for an interesting case and I thought it would be a great opportunity to grapple with the issue to learn new stuff. Hopefully Mr. Grothendieck taught us an ingenious way of solving it. Let me break this to you first:

  • For this case we make use of complex number structure (a + bi). As you may already know a is the real part and b is the imaginary part while i is an indeterminate. So we restructure our ind output in a way that the previous value of ind is the real part a and the second to last value is the imaginary part b for example our first value can be restructured as 120 + 0i and the second as 125 + 120i
  • We are doing this because we need to keep both value in the previous iteration so that we could extract the one we need. We use Re function to extract the real part and Im to extract the imaginary part
  • With regard to ret variable we only need to omit first 2 rows and use the rest corresponding to every iteration
  • In the end we write our custom function in a way that we extract imaginary part from previous ind which is in fact our two previous value and distribute it to (1 + current value of ret) and there comes the subtle point: in order to keep the same structure for next iterations we also add the real part of previous iteration as the imaginary part of current value (which is in fact the real part of next iteration) & finally we only extract real parts

I know there might be too much to get here but just let me know if there is anything I could explain more for and and thank you for this great question.

library(dplyr)
library(purrr)

dt %>%
  group_by(id) %>%
  mutate(ind = c(ind[1], 
                 Re(unlist(accumulate(ret[3:n()], .init = ind[2] + ind[1] * 1i,
                                      ~ Im(..1) * (1 + ..2) + Re(..1) * 1i)))))

# A tibble: 10 x 4
# Groups:   id [2]
   id     time      ret   ind
   <chr> <int>    <dbl> <dbl>
 1 a         1  0.00554  120 
 2 a         2 -0.00280  125 
 3 a         3  0.0178   122.
 4 a         4  0.00187  125.
 5 a         5  0.0114   124.
 6 b         1  0.00416  120 
 7 b         2  0.0123   125 
 8 b         3  0.00237  120.
 9 b         4 -0.00365  125.
10 b         5  0.0111   122.

Update One based on @AnilGoyal's brilliant idea

library(dplyr)
library(purrr)

dt %>%
  group_by(id) %>%
  group_by(d = seq(n()) %% 2, .add = TRUE) %>%
  mutate(ind = accumulate(ret[-1], .init = ind[1], ~ (..2 + 1) * ..1)) %>%
  select(-d)

# A tibble: 10 x 5
# Groups:   id, d [4]
       d id     time      ret   ind
   <dbl> <chr> <int>    <dbl> <dbl>
 1     1 a         1  0.00554  120 
 2     0 a         2 -0.00280  125 
 3     1 a         3  0.0178   122.
 4     0 a         4  0.00187  125.
 5     1 a         5  0.0114   124.
 6     0 b         1  0.00416  120 
 7     1 b         2  0.0123   125 
 8     0 b         3  0.00237  120.
 9     1 b         4 -0.00365  125.
10     0 b         5  0.0111   122.

Or in base R we could do this:

do.call(rbind, lapply(split(dt, dt$id), function(x) {
  x$ind <- c(x$ind[1], Re(Reduce(function(a, b) Im(a) * (1 + b) + Re(a) * 1i,
                                 init = x$ind[2] + x$ind[1] * 1i, 
                                 x$ret[3:nrow(x)], accumulate = TRUE)))
  x
}))

     id time          ret      ind
a.1   a    1  0.005543269 120.0000
a.2   a    2 -0.002802719 125.0000
a.3   a    3  0.017751634 122.1302
a.4   a    4  0.001873201 125.2342
a.5   a    5  0.011425261 123.5256
b.6   b    1  0.004155261 120.0000
b.7   b    2  0.012295066 125.0000
b.8   b    3  0.002366797 120.2840
b.9   b    4 -0.003653828 124.5433
b.10  b    5  0.011051443 121.6133
Anoushiravan R
  • 21,622
  • 3
  • 18
  • 41
  • How do I have to adjust your first code (using purrr) to accommodate higher lags? E.g. lag = 2: ind_{t} = ind_{t-2}*(1+ret_{t}). See also my edit of the main question. – glaucon Jul 28 '21 at 08:35
  • 1
    As a workaround, you may refer to my edited answer, which you can easily adjust for purrr too. @glaucon – AnilGoyal Jul 28 '21 at 15:30
  • 1
    @glaucon I think Anil's solution is by far and away the best one, I also added it to my solution in the meantime I see whether I could find an alternative one. – Anoushiravan R Jul 28 '21 at 17:12
  • @glaucon I made some updates please check them. – Anoushiravan R Jul 28 '21 at 22:57
2

As a workaround, you can use following trick in edited circumstances. Note you may change this for any number of simultaneous series

  • I just added an extra group_by statement based on a modulo sequence of required number of variables using seq(n()) %% 2
set.seed(13)
dt <- data.frame(id = rep(letters[1:2], each = 5), time = rep(1:5, 2), ret = rnorm(10)/100)
dt$ind <- ifelse(dt$time == 1, 120, ifelse(dt$time == 2, 125, as.numeric(NA)))
library(dplyr, warn.conflicts = F)

dt %>% group_by(id) %>%
  group_by(d = seq(n()) %% 2, .add = TRUE) %>%
  mutate(ind = cumprod(1 + duplicated(id) * ret)* ind[1])
#> # A tibble: 10 x 5
#> # Groups:   id, d [4]
#>    id     time      ret   ind     d
#>    <chr> <int>    <dbl> <dbl> <dbl>
#>  1 a         1  0.00554  120      1
#>  2 a         2 -0.00280  125      0
#>  3 a         3  0.0178   122.     1
#>  4 a         4  0.00187  125.     0
#>  5 a         5  0.0114   124.     1
#>  6 b         1  0.00416  120      0
#>  7 b         2  0.0123   125      1
#>  8 b         3  0.00237  120.     0
#>  9 b         4 -0.00365  125.     1
#> 10 b         5  0.0111   122.     0

OLD answer: Without using purrr

library(tidyverse)

set.seed(13)
dt <- data.frame(id = rep(letters[1:2], each = 4), time = rep(1:4, 2), ret = rnorm(8)/100)
dt$ind <- if_else(dt$time == 1, 100, as.numeric(NA))
dt
#>   id time          ret ind
#> 1  a    1  0.005543269 100
#> 2  a    2 -0.002802719  NA
#> 3  a    3  0.017751634  NA
#> 4  a    4  0.001873201  NA
#> 5  b    1  0.011425261 100
#> 6  b    2  0.004155261  NA
#> 7  b    3  0.012295066  NA
#> 8  b    4  0.002366797  NA

dt %>% group_by(id) %>%
  mutate(ind = cumprod(1 + duplicated(id) * ret)* ind[1])
#> # A tibble: 8 x 4
#> # Groups:   id [2]
#>   id     time      ret   ind
#>   <chr> <int>    <dbl> <dbl>
#> 1 a         1  0.00554 100  
#> 2 a         2 -0.00280  99.7
#> 3 a         3  0.0178  101. 
#> 4 a         4  0.00187 102. 
#> 5 b         1  0.0114  100  
#> 6 b         2  0.00416 100. 
#> 7 b         3  0.0123  102. 
#> 8 b         4  0.00237 102.

Created on 2021-07-27 by the reprex package (v2.0.0)

AnilGoyal
  • 25,297
  • 4
  • 27
  • 45
1

Anoushiravan's suggestion solved my issue. Here is my final code that accommodates all my requirements: (i) group by id, (ii) condition on time (here, time>=2), (iii) starting value other than 100 (here, ind = 150):

library(dplyr)
library(purrr)


set.seed(13)
dt <- data.frame(id = rep(letters[1:2], each = 4), time = rep(1:4, 2), ret = rnorm(8)/100)
dt$ind <- if_else(dt$time == 2, 150, as.numeric(NA))
dt

dt_tmp <- dt %>%
  group_by(id) %>%
  filter(time>=2) %>%
  mutate(
    ind =  accumulate(ret[-1], .init = ind[1], ~ (..2 + 1) * ..1)
         )

dt_tmp <- dt_tmp %>% select(id, time, ind)

dt <- dt %>% left_join(dt_tmp, by = c("id", "time"))
dt <- rename(dt, ind_orig = ind.x)
dt <- rename(dt, ind = ind.y)
rm(dt_tmp) 

Final output:

enter image description here


Edit:

In the end, I solved my issue with higher lags using a loop (which I initially wanted to avoid):

my_projection <- function(index, ret, lag) {
  if (length(index) != length(ret)) {
    print("error: length of vectors does not match")
    break;
  }
  if (lag < 0) {
    print("error: lag < 0")
    break;
  }
  else {
    for(i in 1:length(index)){
      if (i<=lag){
        print(index[i])
      }
      else {
        print(index[i-lag]*(1+ret[i]))
        index[i] = index[i-lag]*(1+ret[i])
    }
  }
  }
  return(index)
}


dt <- dt %>% group_by(id) %>%
  mutate(ind = my_projection(ind, ret, 2))

The output:

enter image description here

glaucon
  • 190
  • 1
  • 9