Calculating the percentage of matching observation from one period to another in a panel data

Question

I have a time-series panel dataset that is structured in the following way: There are multiple funds that each own multiple stocks and we have a value column for the stock. As you can see the panel is not balanced. My actual dataset is very large with each fund having at least 500 stocks and different quarters being represented with some having missing quarter values.

df <- data.frame(
  fund_id = c(1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2),
  stock_id = c(1,1,1,1,1,1,2,2,2,2,2,2,2,1,1,3,3,3,3),
  year_q = c("2011-03","2011-06","2011-09","2011-12","2012-03","2012-06","2011-12","2012-03","2012-06","2012-09",
           "2012-12","2013-03","2013-06","2014-09","2015-03","2013-03","2013-06","2013-09","2013-12"),
  value = c(1,2,1,3,4,2,1,2,3,4,2,1,3,1,1,3,2,3,1)
)


> df
   fund_id stock_id  year_q value
1        1        1 2011-03     1
2        1        1 2011-06     2
3        1        1 2011-09     1
4        1        1 2011-12     3
5        1        1 2012-03     4
6        1        1 2012-06     2
7        1        2 2011-12     1
8        1        2 2012-03     2
9        1        2 2012-06     3
10       1        2 2012-09     4
11       1        2 2012-12     2
12       1        2 2013-03     1
13       1        2 2013-06     3
14       2        1 2014-09     1
15       2        1 2015-03     1
16       2        3 2013-03     3
17       2        3 2013-06     2
18       2        3 2013-09     3
19       2        3 2013-12     1

I would like to calculate for each fund, the percentage of stocks held in the current quarter that were ever held in the previous one to 3 quarters. So basically for every fund and every date, I would like to have 3 columns with past 1 Q, past 2Q and past 3Q which show what percentage of stocks held on that date were also present in each of those past quarters.
Here is what the result should look like:

result <- data.frame(
  fund_id = c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2),
  year_q = c("2011-03","2011-06","2011-09","2011-12","2012-03","2012-06","2012-09","2012-12","2013-03","2013-06",
               "2013-03","2013-06","2013-09","2013-12","2014-03","2014-06","2014-09","2014-12","2015-03"),
  past_1Q = c("NA",1,1,0.5,1,1,0.5,1,1,1,"NA",1,1,1,"NA","NA","NA","NA","NA"),
  past_2Q = c("NA","NA",1,0.5,0.5,1,0.5,0.5,1,1,"NA","NA",1,1,"NA","NA","NA","NA","NA"),
  past_3Q = c("NA","NA","NA",0.5,0.5,0.5,0.5,0.5,0.5,1,"NA","NA","NA",1,"NA","NA","NA","NA","NA")
)

> result
   fund_id  year_q past_1Q past_2Q past_3Q
1        1 2011-03      NA      NA      NA
2        1 2011-06       1      NA      NA
3        1 2011-09       1       1      NA
4        1 2011-12     0.5     0.5     0.5
5        1 2012-03       1     0.5     0.5
6        1 2012-06       1       1     0.5
7        1 2012-09     0.5     0.5     0.5
8        1 2012-12       1     0.5     0.5
9        1 2013-03       1       1     0.5
10       1 2013-06       1       1       1
11       2 2013-03      NA      NA      NA
12       2 2013-06       1      NA      NA
13       2 2013-09       1       1      NA
14       2 2013-12       1       1       1
15       2 2014-03      NA      NA      NA
16       2 2014-06      NA      NA      NA
17       2 2014-09      NA      NA      NA
18       2 2014-12      NA      NA      NA
19       2 2015-03      NA      NA      NA

I tried to do this using rollapply but can't get the correct results. I understand that this might not be the best sample data but in my real data each fund usually have more than 500 stocks and I expect the percentage of matching stocks from one period to the past periods to be something around 0.95 on average.

This is what I have to get the first two result columns (credits to @r2evans):

result <- df %>%
  group_by(fund_id) %>%
  mutate(miny = min(year_q), maxy = max(year_q)) %>%
  distinct(fund_id, miny, maxy) %>%
  group_by(fund_id) %>%
  mutate(across(c(miny, maxy), ~ as.Date(paste0(., "-01")))) %>%
  transmute(year_q = purrr::map2(miny, maxy, ~ format(seq(.x, .y, by = "3 months"), format = "%Y-%m")))  %>%
  tidyr::unnest(year_q) %>%
  full_join(df, by = c("fund_id", "year_q")) %>%
  distinct(fund_id, year_q) %>%
  arrange(fund_id, year_q)

For your previous question, there was code that filled in the quarter's gaps. Is there a reason you're not using that fix? — r2evans, Jan 26 '22 at 01:12
Hi, I am. I even wrote it down and gave the credits towards the bottom. But I'm not sure how to go about calculating the percentages of matching stocks for each quarter. — Erwin Rhine, Jan 26 '22 at 18:31
I wasn't really asking for credits, thanks. I'm wondering why your sample data still has those gaps. — r2evans, Jan 26 '22 at 18:37
Could you explain why `2012-09` for `id=1` is `0.5, 0.5, 0.5` instead of `1, 1, 1`? Other rows of the expected output also look strange. — Darren Tsai, Aug 24 '22 at 13:32

phiver · Answer 1 · 2022-08-24T13:45:07.920

Given your example, I think the code below gets you there. You might need to switch to data.table if you have a lot of records. Note that I use df1, not df. df is a function in R. I used padr::pad to fill in the missing quarters within a fund. So it will only fill in quarters if there is data from at least 1 stock in the fund. It will not add quarters that are in fund 2 to fund 1 as these have nothing to do with fund 1.

edit: Added a group by in the lag function to correctly lag over the stock_id as the arrange puts the NA values for the stock_ids, instead of the desired order.

df1 %>% 
  mutate(year_q = ymd(paste0(year_q, "-01"))) %>% 
  group_by(fund_id) %>% 
  padr::pad(interval = "3 months") %>% 
  arrange(fund_id, stock_id, year_q) %>% 
  mutate(past_1Q = if_else(stock_id == lag(stock_id, order_by = year_q, default = 0), 1, 0),
         past_2Q = if_else(stock_id == lag(stock_id, n = 2, order_by = year_q, default = 0), 1, 0),
         past_3Q = if_else(stock_id == lag(stock_id, n = 3, order_by = year_q, default = 0), 1, 0)) %>% 
  group_by(year_q, .add = TRUE) %>% 
  # add number of stocks in the fund in this quarter
  mutate(n_stocks = n()) %>% 
  summarise(past_1Q = sum(past_1Q, na.rm = T) / mean(n_stocks, na.rm = T),
            past_2Q = sum(past_2Q, na.rm = T) / mean(n_stocks, na.rm = T),
            past_3Q = sum(past_3Q, na.rm = T) / mean(n_stocks, na.rm = T))

# A tibble: 19 × 5
# Groups:   fund_id [2]
   fund_id year_q     past_1Q past_2Q past_3Q
     <dbl> <date>       <dbl>   <dbl>   <dbl>
 1       1 2011-03-01     0       0       0  
 2       1 2011-06-01     1       0       0  
 3       1 2011-09-01     1       1       0  
 4       1 2011-12-01     0.5     0.5     0.5
 5       1 2012-03-01     0       1       0.5
 6       1 2012-06-01     0       1       0  
 7       1 2012-09-01     1       0       1  
 8       1 2012-12-01     1       1       0  
 9       1 2013-03-01     1       1       1  
10       1 2013-06-01     1       1       1  
11       2 2013-03-01     0       0       0  
12       2 2013-06-01     1       0       0  
13       2 2013-09-01     1       1       0  
14       2 2013-12-01     1       1       1  
15       2 2014-03-01     0       0       0  
16       2 2014-06-01     0       0       0  
17       2 2014-09-01     0       0       0  
18       2 2014-12-01     0       0       0  
19       2 2015-03-01     0       1       0

I think your results table is incorrect. Looking at 2012-12, there is only one stock live in fund 1. Based on this calculation the outcome should be 100%, not 50%. As 100% of the stocks in the fund now, where also in the fund last quarter, etc. etc.

Nice solution! But why does the last row (`2015-03` for id=2) be 1, 0, 0? It should be 0, 1, 0 because its previous quarter (`2014-12`) has no stock record, right? — Darren Tsai, Aug 24 '22 at 13:25
@DarrenTsai, See edit. The lag needed to be ordered extra to account for NA stock_ids. — phiver, Aug 24 '22 at 13:45

Darren Tsai · Accepted Answer · 2022-08-24T13:26:48.897

library(tidyverse)

df %>%
  mutate(year_q = as.Date(paste0(year_q, '-01'))) %>%
  group_by(fund_id, year_q) %>%
  summarise(stock_id = list(unique(stock_id))) %>%
  complete(year_q = seq(min(year_q), max(year_q), by = "3 months")) %>%
  reduce(.init = ., 1:3, ~ mutate(.x, "past_{.y}Q" := map(1:n(), \(N) unlist(stock_id[pmax(N-.y, 0)])))) %>%
  mutate(across(contains("past"), \(past) map2_dbl(stock_id, past, ~ mean(.x %in% .y)) %>% replace_na(0))) %>%
  ungroup()

# A tibble: 19 × 6
   fund_id year_q     stock_id  past_1Q past_2Q past_3Q
     <dbl> <date>     <list>      <dbl>   <dbl>   <dbl>
 1       1 2011-03-01 <dbl [1]>     0       0       0  
 2       1 2011-06-01 <dbl [1]>     1       0       0  
 3       1 2011-09-01 <dbl [1]>     1       1       0  
 4       1 2011-12-01 <dbl [2]>     0.5     0.5     0.5
 5       1 2012-03-01 <dbl [2]>     1       0.5     0.5
 6       1 2012-06-01 <dbl [2]>     1       1       0.5
 7       1 2012-09-01 <dbl [1]>     1       1       1  
 8       1 2012-12-01 <dbl [1]>     1       1       1  
 9       1 2013-03-01 <dbl [1]>     1       1       1  
10       1 2013-06-01 <dbl [1]>     1       1       1  
11       2 2013-03-01 <dbl [1]>     0       0       0  
12       2 2013-06-01 <dbl [1]>     1       0       0  
13       2 2013-09-01 <dbl [1]>     1       1       0  
14       2 2013-12-01 <dbl [1]>     1       1       1  
15       2 2014-03-01 <NULL>        0       0       0  
16       2 2014-06-01 <NULL>        0       0       0  
17       2 2014-09-01 <dbl [1]>     0       0       0  
18       2 2014-12-01 <NULL>        0       0       0  
19       2 2015-03-01 <dbl [1]>     0       1       0

Calculating the percentage of matching observation from one period to another in a panel data

2 Answers2