Conditional filtering of data.frame with preceeding and tailing NA observations

Question

I have a data.frame composed of observations and modelled predictions of data. A minimal example dataset could look like this:

myData <- data.frame(tree=c(rep("A", 20)), doy=c(seq(75, 94)), count=c(NA,NA,NA,NA,0,NA,NA,NA,NA,1,NA,NA,NA,NA,2,NA,NA,NA,NA,NA), pred=c(0,0,0,0,1,1,1,2,2,2,2,3,3,3,3,6,9,12,20,44))

The count column represents when observations were made and predictions are modelled over a complete set of days, in effect interpolating the data to a day level (from every 5 days).

I would like to conditionally filter this dataset so that I end up truncating the predictions to the same range as the observations, in effect keeping all predictions between when count starts and ends (i.e. removing preceding and trailing rows/values of pred when they correspond to an NA in the count column). For this example, the ideal outcome would be:

   tree doy count pred
5     A  79     0    1
6     A  80    NA    1
7     A  81    NA    1
8     A  82    NA    2
9     A  83    NA    2
10    A  84     1    2
11    A  85    NA    2
12    A  86    NA    3
13    A  87    NA    3
14    A  88    NA    3
15    A  89     2    3

I have tried to solve this problem through combining filter with first and last, thinking about using a conditional mutate to create a column that determines if there is an observation in the previous doy (probably using lag) and filling that with 1 or 0 and using that output to then filter, or even creating a second data.frame that contains the proper doy range that can be joined to this data.

In my searches on StackOverflow I have come across the following questions that seemed close, but were not quite what I needed:

Select first observed data and utilize mutate

Conditional filtering based on the level of a factor R

My actual dataset is much larger with multiple trees over multiple years (with each tree/year having different period of observation depending on elevation of the sites, etc.). I am currently implementing the dplyr package across my code, so an answer within that framework would be great but would be happy with any solutions at all.

Updated with a data.table option as you mentioned in the comments — akrun, Jun 24 '15 at 04:30

akrun · Answer 1 · 2015-06-24T07:18:26.303

Try

  indx <- which(!is.na(myData$count))
  myData[seq(indx[1], indx[length(indx)]),]
  #    tree doy count pred
  #5     A  79     0    1
  #6     A  80    NA    1
  #7     A  81    NA    1
  #8     A  82    NA    2
  #9     A  83    NA    2
  #10    A  84     1    2
  #11    A  85    NA    2
  #12    A  86    NA    3
  #13    A  87    NA    3
  #14    A  88    NA    3
  #15    A  89     2    3

If this is based on groups

 ind <- with(myData, ave(!is.na(count), tree,
           FUN=function(x) cumsum(x)>0 & rev(cumsum(rev(x))>0)))
  myData[ind,]
 #   tree doy count pred
 #5     A  79     0    1
 #6     A  80    NA    1
 #7     A  81    NA    1
 #8     A  82    NA    2
 #9     A  83    NA    2
 #10    A  84     1    2
 #11    A  85    NA    2
 #12    A  86    NA    3
 #13    A  87    NA    3
 #14    A  88    NA    3
 #15    A  89     2    3

Or using na.trim from zoo

 library(zoo)
 do.call(rbind,by(myData, myData$tree, FUN=na.trim))

Or using data.table

 library(data.table)
 setDT(myData)[,.SD[do.call(`:`,as.list(range(.I[!is.na(count)])))] , tree]
 #   tree doy count pred
 #1:    A  79     0    1
 #2:    A  80    NA    1
 #3:    A  81    NA    1
 #4:    A  82    NA    2
 #5:    A  83    NA    2
 #6:    A  84     1    2
 #7:    A  85    NA    2
 #8:    A  86    NA    3
 #9:    A  87    NA    3
 #10:   A  88    NA    3
 #11:   A  89     2    3

josliber · Accepted Answer · 2015-06-23T21:22:00.307

I think you're just looking to limit the rows to fall between the first and last non-NA count value:

myData[seq(min(which(!is.na(myData$count))), max(which(!is.na(myData$count)))),]
#    tree doy count pred
# 5     A  79     0    1
# 6     A  80    NA    1
# 7     A  81    NA    1
# 8     A  82    NA    2
# 9     A  83    NA    2
# 10    A  84     1    2
# 11    A  85    NA    2
# 12    A  86    NA    3
# 13    A  87    NA    3
# 14    A  88    NA    3
# 15    A  89     2    3

In dplyr syntax, grouping by the tree variable:

library(dplyr)
myData %>%
  group_by(tree) %>%
  filter(seq_along(count) >= min(which(!is.na(count))) &
         seq_along(count) <= max(which(!is.na(count))))
# Source: local data frame [11 x 4]
# Groups: tree
# 
#    tree doy count pred
# 1     A  79     0    1
# 2     A  80    NA    1
# 3     A  81    NA    1
# 4     A  82    NA    2
# 5     A  83    NA    2
# 6     A  84     1    2
# 7     A  85    NA    2
# 8     A  86    NA    3
# 9     A  87    NA    3
# 10    A  88    NA    3
# 11    A  89     2    3

Conditional filtering of data.frame with preceeding and tailing NA observations

2 Answers2