0

I have a time-series panel dataset that is structured in the following way: There are 2 funds that each own different stocks at each time period.

df <- data.frame(
  fund_id = c(1,1,1,1,1,1,1,1, 1, 2,2,2,2),
  time_Q = c(1,1,1,2,2,2,2,3, 3, 1,1,2,2),
  stock_id = c("A", "B", "C", "A", "C", "D", "E", "D", "E", "A", "B", "B", "C")

)

> df
   fund_id time_Q stock_id
1        1      1        A
2        1      1        B
3        1      1        C
4        1      2        A
5        1      2        C
6        1      2        D
7        1      2        E
8        1      3        D
9        1      3        E
10       2      1        A
11       2      1        B
12       2      2        B
13       2      2        C

For each fund, I would like to calculate the percentage of stocks held in that current time_Q that were also ever held in any of the previous one to 2 quarters. So basically for every fund and every time_Q, I would like to have 2 columns with past 1 time_Q, past 1-2 time_Q which show what percentage of stocks held in that time were also present in any of that past time_Qs. Here is what the result should look like:

result <- data.frame(
  fund_id = c(1,1,1,2,2),
  time_Q = c(1,2,3,1,2),
  past_1Q = c("NA",0.5,1,"NA",0.5),
  past_2Q = c("NA",0.5,1,"NA",0.5)
)

>  result
  fund_id time_Q past_1Q past_1_2Q
1       1      1      NA      NA
2       1      2     0.5     0.5
3       1      3       1       1
4       2      1      NA      NA
5       2      2     0.5     0.5

I already asked a similar question here, but now I'm looking for common elements across any of the past lagged periods. I'm looking for a dplyr or data.table scalable solution where I can have around 12 past quarters and deal with multiple funds and stocks and time periods.

Thanks in advance!

Erwin Rhine
  • 303
  • 2
  • 11

1 Answers1

1

my solution

# dummy data
df <- data.table(fund_id = c(1,1,1,1,1,1,1,1, 1, 2,2,2,2)
                 , time_Q = c(1,1,1,2,2,2,2,3, 3, 1,1,2,2)
                 , stock_id = c("A", "B", "C", "A", "C", "D", "E", "D", "E", "A", "B", "B", "C")
                 ); df

# lower case col names
names(df) <- tolower(names(df))

# unique grouping
x <- df[, .(dummy =.N), .(fund_id, time_q)][, dummy := NULL]

# initialise empty table
y <- NULL

# loop
for(i in 1:nrow(x))
{
  # current quarter & before
  z <- df[fund_id == x[i, fund_id]
          & time_q %between% c( x[i, time_q] - 12, x[i, time_q])
          ]
  
  # current quarter
  a <- z[fund_id == x[i, fund_id]
         & time_q == x[i, time_q]
         , unique(stock_id)
         ]
  
  # minus 1 to minus 12 quarter (lapply)
  b <- lapply(1:12, \(j) z[fund_id == x[i, fund_id]
                           & time_q %between% c( x[i, time_q] - j, x[i, time_q] - 1)
                           , unique(stock_id)
                           ]
              )
  
  # results
  c <- data.table(fund_id = x[i, fund_id]
                  , current_q = x[i, time_q]
                  )
  
  # no. of stocks in current quarter
  d <- length(a)
  
  # calculate % for the 12 periods
  c[, paste0('past_1_to_', 1:12, '_q') := lapply(1:12, \(j) length(intersect(a,b[[j]])) / d) ]
  
  # collect results
  y <- rbind(y, c)
}

benchmark

x <- 1e3
df <- data.table(fund_id = rep(1:x, each = x/10)
                 , time_Q = rep(1:4, each = x/4)
                 , stock_id = sample(letters[1:26], size=20, replace=T)
                 )

took 20 seconds on the above df with 100k rows and 1,200 groups (fund_id, time_q)

Sweepy Dodo
  • 1,761
  • 9
  • 15
  • Thank you very much for your response. If I would like to apply this to my actual dataset where I'm looking in the past 1-12 periods, how would I go about modifying your code? Do I need to create e.g. minus 1 to minus 12 quarters part? – Erwin Rhine Feb 03 '22 at 21:04
  • 1
    np. tbh, given a 12 month period, I would not suggest hard coding certain block 12 times (bad programming practice). It would require significant change. Idea is to pass on a vector `1:12` to an apply function which subsets df `z`, get distinct `stock_id` for each period. Then pass on this list to df `d` where it will automatically create 12 columns. – Sweepy Dodo Feb 03 '22 at 21:21
  • Yeah, that's why I was hoping for a solution that would be scalable, similar to the response I have linked in the question. Also when I try to run your solution on my actual data even for the 1-2 past periods I get the following error: Error: non-numeric argument to binary operator. In my actual data, the time_q is in the following character format for quarterly dates 2011-03. – Erwin Rhine Feb 03 '22 at 21:32
  • Also tbh I don't really care about any periods between 1-12 as long as I have the percentage matches with respect to the past 1-12 periods that's fine. So one column would do in that case. – Erwin Rhine Feb 03 '22 at 21:33
  • 1
    @ErwinRhine Hi, I just automated it through the 12 periods. Give it a go Tho u'd have to tweak the script to calculate date diff. Try `lubridate` – Sweepy Dodo Feb 03 '22 at 22:28
  • Hi again! this is a long shot in the dark! but my previous solution to this problem no longer works! While implementing your solution I was not able to do it with the dates instead of the number for time_q. My dates look like this: 2012-3, 2012-06 etc. – Erwin Rhine Aug 19 '22 at 22:44
  • Sorry for the late reply. Do you see this line `time_q %between% c( x[i, time_q] - j, x[i, time_q] - 1)` This calculates the difference in quarter. However, you will need to turn your e.g. `2012-3` into a date for the above to work (e.g. `2012-03-01`). After that you will need to change the above to use functions e.g. `difftime` or look up library `lubridate` as it's great for dates data. If all else fails, start a new question with the link to this question embedded to give context – Sweepy Dodo Aug 24 '22 at 19:57