Subsetting based on observations in a month

Question

I'm trying to subset some data and am stuck on the last part of cleaning.

What I need to do is calculate the number of observations for each individual (indivID) in months (June, July, and August) and return a percentage for each without missing data and then keep those observations that are over 75%.

I was able to create a nested for loop, but it took probably 6 hours to process today. I would like to be able to take advantage of parallel computer by using ddply, or another function, but an very lost.

Here's the data (Note this is a very small subset that only includes individuals from 1:5): https://www.dropbox.com/s/fmk8900622klsgt/data.csv?dl=0

And here's the for loop :

epa.d <- read.csv("/.../data.csv")

#Function for loops
days <- function (month){
     if (month == 06) return(as.numeric(30))
     if (month == 07) return(as.numeric(31))
     if (month == 08) return(as.numeric(31))

}    

#Subset data for 75% in June, July, and August
    for (i in unique(epa.d$indivID)){
         for (j in unique(epa.d$year)){
              for (k in unique(epa.d$month)){
                   monthsum <- sum(epa.d$indivID == i & epa.d$year == j & epa.d$month == k   )
                   monthperc = (monthsum/days(k))* 100
                   if (monthperc < 75){
                        epa.d <- epa.d[! (epa.d$indivID == i & epa.d$year == j), ]  

                   }
              }
         }
    }

eipi10 · Accepted Answer · 2014-11-23T00:13:29.727

If I understand you correctly, you want to keep daily observations for each combination of indivID-month-year in which at least 75% of days have ozone measurements. Here's a way to do it that should be pretty fast:

library(dplyr)  

# For each indivID, calculate percent of days in each month with 
# ozone observations, and keep those with pctCoverage >= 0.75
epa.d_75 = epa.d %>% 
  group_by(indivID, year, month) %>%
  summarise(count=n()) %>% 
  mutate(pctCoverage = ifelse(month==6, count/30, count/31)) %>%
  filter(pctCoverage >= 0.75)

We now have a data frame epa.d_75 that has one row for each indivID-month-year with at least 75% coverage. Next, we'll merge the daily data into this data frame, resulting in one row for each daily observation for each unique indivID-month-year.

# Merge in daily data for each combination of indivID-month-year that meets
# the 75% coverage criterion
epa.d_75 = merge(epa.d_75, epa.d, by=c("indivID","month","year"),
                 all.x=TRUE)

Update: To answer the questions in the commments:

Can you explain what the %>% is doing, and if possible a break down of how you logically thought about this.

The %>% is a "chaining" operator that allows you to chain functions one after the other without having to store the result of the previous function before running the next one. Take a look at the dplyr Vignette to learn more about how to use it. Here's how the logic works in this case:

group_by splits the data set by the grouping variables, then runs the next functions separately on each group. In this case, summarise counts the number of rows in the data frame for each unique combination of indivID, month, and year, then mutate adds a column with the fractional coverage for that indivID for that month and year. filter then gets rid of any combination of indivID, month, and year with less than 75% coverage. You can stop the chain at any point to see what it's doing. For example, run the following code to see what epa.d_75 looks like before the filtering operation:

 epa.d_75 = epa.d %>% 
  group_by(indivID, year, month) %>%
  summarise(count=n()) %>% 
  mutate(pctCoverage = ifelse(month==6, count/30, count/31))

why the hell this is so much faster than running for loops? I don't know the answer in detail, but dplyr does most of its magic in C code under the hood, which is faster than native R. Hopefully someone else can give a more precise and detailed answer.

the `epa.d_75` returns 0 rows for me by using the dropbox dataset. Is that what you got? — akrun, Nov 22 '14 at 04:59
With the subset of data the OP provided, yes. When I tested the code, I set the cut-point to 0.25 to ensure that some rows were returned, but his actual data set is much larger and will have lots of months with more than 75% coverage. — eipi10, Nov 22 '14 at 05:01
Wow. I just used this code to process the entire data set and it was done in seconds. However, the # of obs is not the same as the for loop I ran. Can you explain what the %>% is doing, and if possible a break down of how you logically thought about this. Thanks! — Vedda, Nov 22 '14 at 05:07
And if possible why the hell this is so much faster than running for loops? I let that for loop run all day! Thank you so much for your help — Vedda, Nov 22 '14 at 05:09
I suspect the speed of your original code is not only (or even mainly) due to using loops, but might have more to do with the number and types of operations performed each time through the loop. You might try running it on a small data frame--maybe just 20 or 30 rows--and see what happens. — eipi10, Nov 22 '14 at 05:36

score 2 · Answer 2 · answered Nov 22 '14 at 05:07

2

Another option would be to use data.table (similar to @eipi10's dplyr method), which would be very fast.

library(data.table)
epa.d_75 <-  setDT(epa.d)[, list(pctCoverage=ifelse(month==6, .N/30,
               .N/31)),by=list(indivID, year, month)][pctCoverage >=0.75]

epa.d_75New = merge(epa.d_75, epa.d, by=c("indivID","month","year"),
             all.x=TRUE)

data

epa.d <- read.csv('data.csv', row.names=1)

answered Nov 22 '14 at 05:07

akrun

874,273
37
540
662

Thanks for your reply. I ran this on the data set and it worked the same, with the same obs. Thanks! – Vedda Nov 22 '14 at 05:13

Subsetting based on observations in a month

2 Answers2

data

Linked