Period apply after filtering missing data

Question

I have a large xts of 10 minute data that I am trying to average hourly. The data also has some missing data within it. I want to average all of the hours that have >50% data completeness but replace all hours that have <50% data completeness with NA. How would I go about doing that?

Representative data:

library(lubridate)
library(xts)
set.seed(1001)
starttime <- ydm_hms('2001-01-01 10:00:00')
endtime <- ydm_hms('2001-01-01 12:50:00')
timevec <- seq(starttime,endtime,by='10 min')
data <- data.frame(replicate(2,sample(0:10,length(timevec),rep=TRUE)))
datana <- as.matrix(data)
datana[sample(seq_along(datana),0.4*length(datana))] <- NA
dat_na <- as.data.frame(datana)|> type.convert(as.is=TRUE)
dat_natime <- cbind(time=timevec,dat_na)
xtsdatna <- xts(dat_natime,order.by = dat_natime$time)

Expected result

                 time   X1   X2
1 2001-01-01 10:00:00 4.50 6.20
2 2001-01-01 11:00:00   NA 8.00
3 2001-01-01 12:00:00 6.25 4.67

score 1 · Accepted Answer · answered Jun 23 '23 at 00:02

You can write a function which checks if the portion of NA values is above a threshold and returns NA if so, and the mean otherwise.

mean_or_na <- function(x, na_threshold){
  if (mean(is.na(x)) > na_threshold) {
    NA_real_
  } else {
    mean(x, na.rm = TRUE)
  }
}

I'm not familiar with xts, but using the data.frame format you could apply the function with:

suppressPackageStartupMessages({
library(dplyr)
library(lubridate)})

dat_natime %>% 
  group_by(time_hour = floor_date(time, 'hour')) %>% 
  summarise(across(starts_with('X'), \(x) mean_or_na(x, 0.5)))
#> # A tibble: 3 × 3
#>   time_hour              X1    X2
#>   <dttm>              <dbl> <dbl>
#> 1 2001-01-01 10:00:00   7   NA   
#> 2 2001-01-01 11:00:00   2.8  3   
#> 3 2001-01-01 12:00:00   2.5  5.33

G. Grothendieck · Answer 2 · 2023-06-25T16:05:56.790

The problem is that the xtsdatna object shown in the question is malformed. The time index in xts objects is stored as an attribute of the object and not as part of the data. That is why your data wound up being entirely character and so likely nothing worked that was tried.

First we correct that line and then use aggregate.zoo to generate the summary. aggregate.zoo applies the second argument to the time index using the result as a grouping vector and new index applying the third argument over each column of each group of the partition.

This gives a summarized xts object. Omit the as.xts() part if a zoo object is wanted instead.

xtsdatna <- xts(dat_na, timevec)

Mean <- function(x) if (mean(is.na(x)) > 0.5) NA else mean(x, na.rm = TRUE)
floor_hour <- function(x) floor_date(x, "hour")
aggregate(xtsdatna, floor_hour, Mean) |> as.xts()
##                      X1       X2
## 2001-01-01 10:00:00 7.0       NA
## 2001-01-01 11:00:00 2.8 3.000000
## 2001-01-01 12:00:00 2.5 5.333333

Period apply after filtering missing data

2 Answers2