subset my df provided that each ID has >10 obs a month

Question

I am trying to clean my stocks' df and I need to get rid of the ones that have less than 10 observations per month.

Already checked these 2 threads: subsetting-based-on-observations-in-a-month and ddply-for-sum-by-group-in-r

But I'm a noob and I cannot figure it out yet.

In short: Please, help me out eliminating IDs (Stocks) whose observations per month are <10 (for any month if possible). They are Id'd via the permanent number from CRSP (permno).

Here is the df: Lessthan10days.csv

Thank you so much,

Leo

Instead of a big dataset, it would have been better if you showed few lines of your data and the expected output based on that. — akrun, Mar 11 '15 at 13:49
Do you need to remove the IDs which have at least one month that have less than 10 observations — akrun, Mar 11 '15 at 13:59

akrun · Accepted Answer · 2015-03-11T14:26:57.950

2

We could create a column 'MonthYr' from the 'date' column after converting it to 'Date' class. Get the number of observations ('n') per group ('permno', 'MonthYr') and use that to remove the IDs ('permno') that have at least one 'n' less than 10.

library(dplyr)
res <- df1 %>%
        mutate(MonthYr=format(as.Date(date, format='%m/%d/%Y'), '%Y-%m')) %>%
        group_by(permno, MonthYr) %>%
        mutate(n=n()) %>% 
        group_by(permno) %>% 
        filter(all(n>=10))

 all(res$n>=10)
 #[1] TRUE
 tbl <-table(res$permno, res$MonthYr)
 all(tbl[tbl!=0]>=10)
 #[1] TRUE

Or using similar approach withdata.table

 library(data.table)
  setDT(df1)[,N:=.N , list(permno, MonthYr=format(as.Date(date, 
             format='%m/%d/%Y'), '%Y-%m'))][all(N>=10) , permno][]

data

df1 <- read.csv('Lessthan10days.csv', header=TRUE, stringsAsFactors=FALSE)

edited Mar 11 '15 at 14:26

answered Mar 11 '15 at 14:01

akrun

874,273
37
540
662

Nor the setDT nor the %>% functions are found in my R packages, what am I doing wrong? – Leo Del Mar Mar 11 '15 at 15:13
@LeoDelMar Have you installed `dplyr` or `data.table`? – akrun Mar 11 '15 at 15:14
and you loaded `library(dplyr)`, `library(data.table)` before running the code? If so, can you show the versions you have. – akrun Mar 11 '15 at 15:20
@LeoDelMar In some older versions, `%.%` may work, and instead of `setDT(df1)`, `as.data.table(df1)` – akrun Mar 11 '15 at 15:24
That's what I was missing, loading them, never seen it earlier, sorry. – Leo Del Mar Mar 11 '15 at 15:27

score 0 · Answer 2 · answered Mar 12 '15 at 10:32

I'd just like to add that the next commands work partially:

library(dplyr)
res <- df1 %>%
        mutate(MonthYr=format(as.Date(date, format='%m/%d/%Y'), '%Y-%m')) %>%
        group_by(permno, MonthYr) %>%
        mutate(n=n()) %>% 
        group_by(permno) %>% 
        filter(all(n>=10))

 all(res$n>=10)
 #[1] TRUE
 tbl <-table(res$permno, res$MonthYr)
 all(tbl[tbl!=0]>=10)
 #[1] TRUE

They do not perfectly clean the sample, I believe that some NA values are counted as observations, so they might 'escape' the subsetting/cleaning.

Therefore I did it manually to be sure. A suggestion I can propose would be using just:

>tbl <-table(res$permno, res$MonthYr)
>write.csv(tbl,"tbl.csv")

And then you look into the spreadsheet yourself for cleaning obs<10 (for each year/stock). On top of that, you can filter the NA values for Price, and erase the 5-10 stocks (ids) that present a couple of months with <10 observations.

Hope this helps a bit. Thanks again for your help!

subset my df provided that each ID has >10 obs a month

2 Answers2

data