0

I am trying to clean my stocks' df and I need to get rid of the ones that have less than 10 observations per month.

Already checked these 2 threads: subsetting-based-on-observations-in-a-month and ddply-for-sum-by-group-in-r

But I'm a noob and I cannot figure it out yet.

In short: Please, help me out eliminating IDs (Stocks) whose observations per month are <10 (for any month if possible). They are Id'd via the permanent number from CRSP (permno).

Here is the df: Lessthan10days.csv

Thank you so much,

Leo

Community
  • 1
  • 1

2 Answers2

2

We could create a column 'MonthYr' from the 'date' column after converting it to 'Date' class. Get the number of observations ('n') per group ('permno', 'MonthYr') and use that to remove the IDs ('permno') that have at least one 'n' less than 10.

library(dplyr)
res <- df1 %>%
        mutate(MonthYr=format(as.Date(date, format='%m/%d/%Y'), '%Y-%m')) %>%
        group_by(permno, MonthYr) %>%
        mutate(n=n()) %>% 
        group_by(permno) %>% 
        filter(all(n>=10))

 all(res$n>=10)
 #[1] TRUE
 tbl <-table(res$permno, res$MonthYr)
 all(tbl[tbl!=0]>=10)
 #[1] TRUE

Or using similar approach withdata.table

 library(data.table)
  setDT(df1)[,N:=.N , list(permno, MonthYr=format(as.Date(date, 
             format='%m/%d/%Y'), '%Y-%m'))][all(N>=10) , permno][]

data

df1 <- read.csv('Lessthan10days.csv', header=TRUE, stringsAsFactors=FALSE)
akrun
  • 874,273
  • 37
  • 540
  • 662
0

I'd just like to add that the next commands work partially:

library(dplyr)
res <- df1 %>%
        mutate(MonthYr=format(as.Date(date, format='%m/%d/%Y'), '%Y-%m')) %>%
        group_by(permno, MonthYr) %>%
        mutate(n=n()) %>% 
        group_by(permno) %>% 
        filter(all(n>=10))

 all(res$n>=10)
 #[1] TRUE
 tbl <-table(res$permno, res$MonthYr)
 all(tbl[tbl!=0]>=10)
 #[1] TRUE

They do not perfectly clean the sample, I believe that some NA values are counted as observations, so they might 'escape' the subsetting/cleaning.

Therefore I did it manually to be sure. A suggestion I can propose would be using just:

>tbl <-table(res$permno, res$MonthYr)
>write.csv(tbl,"tbl.csv")

And then you look into the spreadsheet yourself for cleaning obs<10 (for each year/stock). On top of that, you can filter the NA values for Price, and erase the 5-10 stocks (ids) that present a couple of months with <10 observations.

Hope this helps a bit. Thanks again for your help!