Time Series data aggregation and NA handling using R

Question

I have a time series data of format

                        Ask    Bid  Trade Ask_Size Bid_Size Trade_Size
2016-11-01 01:00:03     NA 938.10     NA       NA      203         NA
2016-11-01 01:00:04     NA 937.20     NA       NA      100         NA
2016-11-01 01:00:04 938.00     NA     NA       28       NA         NA
2016-11-01 01:00:04     NA 938.10     NA       NA      203         NA
2016-11-01 01:00:04 939.00     NA     NA       11       NA         NA
2016-11-01 01:00:05     NA 938.15     NA       NA       19         NA
2016-11-01 01:00:06     NA 937.20     NA       NA      100         NA
2016-11-01 01:00:06 938.00     NA     NA       28       NA         NA
2016-11-01 01:00:06     NA     NA 938.10       NA       NA         69
2016-11-01 01:00:06     NA     NA 938.10       NA       NA        831
2016-11-01 01:00:06     NA 938.10     NA       NA      134         NA

The structure of the time series data is

str(df_ts)

An ‘xts’ object on 2016-11-01 01:00:03/2016-11-02 12:59:37 containing:
  Data: num [1:35797, 1:6] NA NA 938 NA 939 NA NA 938 NA NA ...
 - attr(*, "dimnames")=List of 2
  ..$ : NULL
  ..$ : chr [1:6] "Ask" "Bid" "Trade" "Ask_Size" ...
  Indexed by objects of class: [POSIXct,POSIXt] TZ: 
  xts Attributes:  
 NULL

I am trying to aggregate the data every 1 minute using the following code

# Creating a Function
apply.periodly <- function (x, FUN, period, k = 1, ...) 
{
  if (!require("xts")) {
    stop("Need 'xts'")
  }
  ep <- endpoints(x, on = period, k=k)
  period.apply(x, ep, FUN, ...)
}

# Aggregation every minute

df_aggregate_min <- apply.periodly(x = df_ts, FUN = mean, period = "minutes", k = 1)

But due to "NA" in the data I am getting wrong output. How do I aggregate the columns every minute by ignoring the NA ?

custom mean function (`naMean <- function(x){mean(x, na.rm = TRUE)}`) in your last line there should do the trick — TBSRounder, Nov 25 '16 at 16:18
Thanks, but the resulting output is the mean of the entire column, I would like to get the column wise sum every minute. The code that I used is df_aggregate_min <- apply.periodly(x = df_ts, FUN = naMean, period = "minutes", k = 1) — Abhishek, Nov 25 '16 at 16:27

knb · Accepted Answer · 2016-11-26T11:04:38.797

This is for two single columns:

library(readr)
library(xts)
library(lubridate)
Sys.setenv(TZ = "UTC")
# hack: in-place edit of infile Sample_HFT.csv 
# replace first comma with "T" to create ISO-datetime strings
# do this only ONCE!
system('perl -pi -E "s/,/T/" Sample_HFT.csv')

hft <- read_csv("Sample_HFT.csv", col_names = TRUE)
head(hft)

hft.xts <- as.xts(hft[, -1], order.by = ymd_hms(hft$T))
indexFormat(hft.xts) <- "%y-%m-%d %H:%M:%S"

my.cummean <- function(x) {
    x2 <- x
    cummeans <- cumsum(x2[!is.na(x)]) / seq_along(x2[!is.na(x)])
    cummeans[endpoints(cummeans, "minutes"),]
}

ask_minutes <- split(hft.xts$Ask, f = "minutes")
ask_minutes_cum <- lapply(ask_minutes, my.cummean)
ask_minutes_mean <- do.call("rbind", ask_minutes_cum)

trade_size_minutes <- split(hft.xts$Trade_Size, f = "minutes")
trade_size_minutes_cum <- lapply(trade_size_minutes, my.cummean)
trade_size_minutes_mean <- do.call("rbind", trade_size_minutes_cum)

I still don't know if this is the desired business logic, but I think you can figure the details out.

head(trade_size_minutes_mean) 
                   Trade_Size
16-11-01 01:00:35    194.500
16-11-01 01:01:59     59.909
16-11-01 01:02:48      5.875
16-11-01 01:03:34      6.000
16-11-01 01:08:57      3.889
16-11-01 01:09:29      1.682

Could you please help. The above method didn't work. Sharing the link to the data sample https://www.dropbox.com/s/m94y6pbhjlkny1l/Sample_HFT.csv?dl=0 — Abhishek, Nov 26 '16 at 00:35
I've updated my answer. Can you put the link to your sample file in the question body? It is then more visible to readers potentially motivated to provide solutions in R code. — knb, Nov 26 '16 at 10:39

Time Series data aggregation and NA handling using R

1 Answers1