R: Decrease frequency of time series data by aggregating values in OHLC series

Question

I have a high frequency dataset for foreign exchange rates down to the millisecond which I would like to transform into lower frequency and regular time series data in R, e.g. minutely or 5-minutely OHLC-series (open, high, low, close). The original dataset has four columns, one for the exchange rate, one for the timestamp which includes both the date and time as well as columns for the bid and ask-prices as well. The data have been imported from a .csv file.

{head(GBPUSD)} and {tail(GBPUSD)} returns the following:

# A tibble: 6 x 4
       X1                  X2      X3      X4
    <chr>              <dttm>   <dbl>   <dbl>  
1 GBP/USD 2017-06-01 00:00:00 1.28756 1.28763  
2 GBP/USD 2017-06-01 00:00:00 1.28754 1.28760  
3 GBP/USD 2017-06-01 00:00:00 1.28754 1.28759  
4 GBP/USD 2017-06-01 00:00:00 1.28753 1.28759  
5 GBP/USD 2017-06-01 00:00:00 1.28753 1.28759  
6 GBP/USD 2017-06-01 00:00:00 1.28753 1.28759


# A tibble: 6 x 4
       X1                  X2      X3      X4
    <chr>              <dttm>   <dbl>   <dbl>
1 GBP/USD 2017-06-30 20:59:56 1.30093 1.30300  
2 GBP/USD 2017-06-30 20:59:56 1.30121 1.30300  
3 GBP/USD 2017-06-30 20:59:56 1.30100 1.30390  
4 GBP/USD 2017-06-30 20:59:56 1.30146 1.30452  
5 GBP/USD 2017-06-30 20:59:56 1.30145 1.30447  
6 GBP/USD 2017-06-30 20:59:56 1.30145 1.30447

it would be usefull if you include `head(yourdata)`, `tail(yourdata)`. Also, imgur.com does not work. You can use any other storage. — Erdogan CEVHER, Dec 18 '17 at 11:20
Thanks, please find the head (did not have enough space for the tail). These data have been imported directly from a .csv filefile # A tibble: 6 x 4 X1 X2 X3 X4 1 GBP/USD 2017-06-01 00:00:00 1.28756 1.28763 2 GBP/USD 2017-06-01 00:00:00 1.28754 1.28760 3 GBP/USD 2017-06-01 00:00:00 1.28754 1.28759 4 GBP/USD 2017-06-01 00:00:00 1.28753 1.28759 5 GBP/USD 2017-06-01 00:00:00 1.28753 1.28759 6 GBP/USD 2017-06-01 00:00:00 1.28753 1.28759 — hx1, Dec 18 '17 at 11:40
Edit your question accordingly; not in comments part, but in original question part. Also, use "{}" code symbol to clearly present your data. — Erdogan CEVHER, Dec 18 '17 at 11:43
Possible duplicate of [Rounding time to nearest quarter hour](https://stackoverflow.com/questions/10862056/rounding-time-to-nearest-quarter-hour) — Len Greski, Dec 18 '17 at 11:51
@LenGreski: he is more concerned in the aggregating part (High, Low, Open, Close), I suppose. — Arani, Dec 19 '17 at 08:57
@Arani - agreed, but the OP doesn't describe how the data should be aggregated, or ask for guidance on how to aggregate it. — Len Greski, Dec 19 '17 at 11:26
Yeah the header is not wholly representative of the question in the body. It should be edited. EDIT: Done. — Arani, Dec 19 '17 at 12:02

Davis Vaughan · Answer 1 · 2018-01-03T22:07:18.647

It seems you want to turn each column (bid, ask) into 4 columns (Open, High, Low, Close), grouped by some time interval like 5 minutes. I appreciate @dmi3kno showing off a few tibbletime features, but I think that this might do more of what you want.

~~Note that this will change a bit in the next release of tibbletime, but currently under 0.0.2 this works.~~

For each 5 minute period, the Open/High/Low/Close prices of both the bid and ask columns are taken.

library(tibbletime)
library(dplyr)

df <- create_series("2017-12-20 00:00:00" ~ "2017-12-20 01:00:00", "sec") %>% 
  mutate(bid = runif(nrow(.)),
         ask = bid + .0001)
df
#> # A time tibble: 3,601 x 3
#> # Index: date
#>    date                   bid    ask
#>  * <dttm>               <dbl>  <dbl>
#>  1 2017-12-20 00:00:00 0.208  0.208 
#>  2 2017-12-20 00:00:01 0.0629 0.0630
#>  3 2017-12-20 00:00:02 0.505  0.505 
#>  4 2017-12-20 00:00:03 0.0841 0.0842
#>  5 2017-12-20 00:00:04 0.986  0.987 
#>  6 2017-12-20 00:00:05 0.225  0.225 
#>  7 2017-12-20 00:00:06 0.536  0.536 
#>  8 2017-12-20 00:00:07 0.767  0.767 
#>  9 2017-12-20 00:00:08 0.994  0.994 
#> 10 2017-12-20 00:00:09 0.807  0.808 
#> # ... with 3,591 more rows

df %>%
  mutate(date = collapse_index(date, "5 min")) %>%
  group_by(date) %>%
  summarise_all(
    .funs = funs(
      open  = dplyr::first(.),
      high  = max(.),
      low   = min(.),
      close = dplyr::last(.)
    )
  )
#> # A time tibble: 13 x 9
#> # Index: date
#>    date                bid_o… ask_o… bid_h… ask_h…  bid_low ask_low bid_c…
#>  * <dttm>               <dbl>  <dbl>  <dbl>  <dbl>    <dbl>   <dbl>  <dbl>
#>  1 2017-12-20 00:04:59  0.208  0.208  1.000  1.000 0.00293  3.03e⁻³ 0.389 
#>  2 2017-12-20 00:09:59  0.772  0.772  0.997  0.997 0.000115 2.15e⁻⁴ 0.676 
#>  3 2017-12-20 00:14:59  0.457  0.457  0.995  0.996 0.00522  5.32e⁻³ 0.363 
#>  4 2017-12-20 00:19:59  0.586  0.586  0.997  0.997 0.00912  9.22e⁻³ 0.0339
#>  5 2017-12-20 00:24:59  0.385  0.385  0.998  0.998 0.0131   1.32e⁻² 0.0907
#>  6 2017-12-20 00:29:59  0.548  0.548  0.996  0.996 0.00126  1.36e⁻³ 0.320 
#>  7 2017-12-20 00:34:59  0.240  0.240  0.995  0.995 0.00466  4.76e⁻³ 0.153 
#>  8 2017-12-20 00:39:59  0.404  0.405  0.999  0.999 0.000481 5.81e⁻⁴ 0.709 
#>  9 2017-12-20 00:44:59  0.468  0.468  0.999  0.999 0.00101  1.11e⁻³ 0.0716
#> 10 2017-12-20 00:49:59  0.580  0.580  0.996  0.996 0.000336 4.36e⁻⁴ 0.395 
#> 11 2017-12-20 00:54:59  0.242  0.242  0.999  0.999 0.00111  1.21e⁻³ 0.762 
#> 12 2017-12-20 00:59:59  0.474  0.474  0.987  0.987 0.000858 9.58e⁻⁴ 0.335 
#> 13 2017-12-20 01:00:00  0.974  0.974  0.974  0.974 0.974    9.74e⁻¹ 0.974 
#> # ... with 1 more variable: ask_close <dbl>

Update: The post has been updated to reflect the changes in tibbletime 0.1.0.

Thank you, Davis. I did not see predicate functions in tibbletime, so assumed it would drop class. Agree that this goes extra mile towards desired result. — dmi3kno, Dec 20 '17 at 20:20
@dmi3kno, predicate functions like `summarise_all()` are built using `summarise()` under the hood, so no classes are dropped! — Davis Vaughan, Dec 20 '17 at 20:22

Erdogan CEVHER · Answer 2 · 2017-12-20T09:37:29.040

I changed a little bit the OP's original dataset for pedagogical/instructional reasons below:

df <- data.frame(
X1=c("GBP/USD"), 
X2=c("2017-06-01 00:00:00", "2017-06-01 00:00:00", "2017-06-01 00:00:01", "2017-06-01 00:00:01", "2017-06-01 00:00:01", "2017-06-01 00:00:02", "2017-06-30 20:59:52", "2017-06-30 20:59:54", "2017-06-30 20:59:54", "2017-06-30 20:59:56", "2017-06-30 20:59:56", "2017-06-30 20:59:56"), 
X3=c(1.28756, 1.28754, 1.28754, 1.28753, 1.28752, 1.28757, 1.30093, 1.30121, 1.30100, 1.30146, 1.30145,1.30145), 
X4=c(1.28763, 1.28760, 1.28759, 1.28758, 1.28755, 1.28760,1.30300, 1.30300, 1.30390, 1.30452, 1.30447, 1.30447), 
stringsAsFactors=FALSE)

df

        X1                  X2      X3      X4
1  GBP/USD 2017-06-01 00:00:00 1.28756 1.28763
2  GBP/USD 2017-06-01 00:00:00 1.28754 1.28760
3  GBP/USD 2017-06-01 00:00:01 1.28754 1.28759
4  GBP/USD 2017-06-01 00:00:01 1.28753 1.28758
5  GBP/USD 2017-06-01 00:00:01 1.28752 1.28755
6  GBP/USD 2017-06-01 00:00:02 1.28757 1.28760
7  GBP/USD 2017-06-30 20:59:52 1.30093 1.30300
8  GBP/USD 2017-06-30 20:59:54 1.30121 1.30300
9  GBP/USD 2017-06-30 20:59:54 1.30100 1.30390
10 GBP/USD 2017-06-30 20:59:56 1.30146 1.30452
11 GBP/USD 2017-06-30 20:59:56 1.30145 1.30447
12 GBP/USD 2017-06-30 20:59:56 1.30145 1.30447

Now, in low frequency data, there will be groupings of the same things. So, we must find the indices corresponding to unique startings, and the endings of the groups:

indices <- seq_along(df[,2])[!(duplicated(df[,2]))] # 1  3  6  7  8 10; the beginnings of groups (observations)
indices - 1   # 0  2  5  6  7   9; for finding the endings of groups
numberoflowfreq <- length(indices) # 6: number of groupings (obs.) for Low Freq data

Understand the pattern by writing openly:

mean(df[1:((indices -1)[2]),3]) # from 1 to 2
mean(df[indices[2]:((indices -1)[3]),3]) # from 3 to 5
mean(df[indices[3]:((indices -1)[4]),3]) # from 6 to 6
mean(df[indices[4]:((indices -1)[5]),3]) # from 7 to 7
mean(df[indices[5]:((indices -1)[6]),3]) # from 8 to 9
mean(df[indices[6]:nrow(df),3]) # from 10 to 12

Simplify the pattern:

mean3rdColumn_1st <- mean(df[1:((indices -1)[2]),3]) # from 1 to 2
mean3rdColumn_Between <- sapply(2:(numberoflowfreq-1), function(i)  mean(df[indices[i]:((indices -1)[i+1]),3]) )
mean3rdColumn_Last <- mean(df[indices[6]:nrow(df),3]) # from 10 to 12
# 3rd column in low frequency data:    
c(mean3rdColumn_1st, mean3rdColumn_Between, mean3rdColumn_Last)

Similarly for the 4th column:

mean4thColumn_1st <- mean(df[1:((indices -1)[2]),4]) # from 1 to 2
mean4thColumn_Between <- sapply(2:(numberoflowfreq-1), function(i)  mean(df[indices[i]:((indices -1)[i+1]),4]) )
mean4thColumn_Last <- mean(df[indices[6]:nrow(df),4]) # from 10 to 12
# 4th column in low frequency data: 
c(mean4thColumn_1st, mean4thColumn_Between, mean4thColumn_Last)

Collect all effort:

LowFrqData <- data.frame(X1=c("GBP/USD"), X2=df[indices,2], X3=c(mean3rdColumn_1st, mean3rdColumn_Between, mean3rdColumn_Last),   x4=c(mean4thColumn_1st, mean4thColumn_Between, mean4thColumn_Last), stringsAsFactors=FALSE)
LowFrqData 

       X1                  X2       X3       x4
1 GBP/USD 2017-06-01 00:00:00 1.287550 1.287615
2 GBP/USD 2017-06-01 00:00:01 1.287530 1.287573
3 GBP/USD 2017-06-01 00:00:02 1.287570 1.287600
4 GBP/USD 2017-06-30 20:59:52 1.300930 1.303000
5 GBP/USD 2017-06-30 20:59:54 1.301105 1.303450
6 GBP/USD 2017-06-30 20:59:56 1.301453 1.304487

Now, the column X2 has unique minute values, X3 and X4 were formed by the means of relevant cells.

Also note that: There may not be values for all minutes in a range. One may pump NAs for such cases. On the other hand, one may neglect the effect of the irregularity in such cases since the spacing of the observations would/may be the same for many observations, and therefore not so highly irregular. Also consider the fact that transforming the data into equally spaced observations using linear interpolation can introduce a number of significant and hard to quantify biases (See: Scholes and Williams).

M. Scholes and J. Williams, “Estimating betas from nonsynchronous data”, Journal of Financial Economics 5: 309–327, 1977.

Now, the regular 5-minute series part:

as.numeric(as.POSIXct("1970-01-01 03:00:00"))  # 0; starting point for ZERO seconds. "1970-01-01 03:01:00" equals 60.
as.numeric(as.POSIXct("2017-06-01 00:00:00")) # 1496264400
# Passed seconds after the first observation in the dataset
PassedSecs <- as.numeric(as.POSIXct(LowFrqData$X2)) - 1496264400

LowFrq5minuteRaw <- cbind(LowFrqData, PassedSecs, stringsAsFactors=FALSE)
LowFrq5minuteRaw

       X1                  X2       X3       x4 PassedSecs
1 GBP/USD 2017-06-01 00:00:00 1.287550 1.287615          0
2 GBP/USD 2017-06-01 00:00:01 1.287530 1.287573          1
3 GBP/USD 2017-06-01 00:00:02 1.287570 1.287600          2
4 GBP/USD 2017-06-30 20:59:52 1.300930 1.303000    2581192
5 GBP/USD 2017-06-30 20:59:54 1.301105 1.303450    2581194
6 GBP/USD 2017-06-30 20:59:56 1.301453 1.304487    2581196

5 minutes means 5*60=300 seconds. So, "having same Quotient in Division to 300" groups the observations in 5-minute intervals.

LowFrq5minuteRaw2 <- cbind(LowFrqData, PassedSecs, QbyDto300 = PassedSecs%/%300, stringsAsFactors=FALSE)
LowFrq5minuteRaw2

       X1                  X2       X3       x4 PassedSecs QbyDto300
1 GBP/USD 2017-06-01 00:00:00 1.287550 1.287615          0         0
2 GBP/USD 2017-06-01 00:00:01 1.287530 1.287573          1         0
3 GBP/USD 2017-06-01 00:00:02 1.287570 1.287600          2         0
4 GBP/USD 2017-06-30 20:59:52 1.300930 1.303000    2581192      8603
5 GBP/USD 2017-06-30 20:59:54 1.301105 1.303450    2581194      8603
6 GBP/USD 2017-06-30 20:59:56 1.301453 1.304487    2581196      8603

indices2 <- seq_along(LowFrq5minuteRaw2[,6])[!(duplicated(LowFrq5minuteRaw2[,6]))] # 1  4; the beginnings of groups

LowFrq5minute <- data.frame(X1=c("GBP/USD"), X2=LowFrq5minuteRaw2[indices2,2], X3=aggregate(LowFrqData[,3] ~ QbyDto300, LowFrq5minuteRaw2, mean)[,2], X4=aggregate(LowFrqData[,4] ~ QbyDto300, LowFrq5minuteRaw2, mean)[,2])
LowFrq5minute

       X1                  X2       X3       X4
1 GBP/USD 2017-06-01 00:00:00 1.287550 1.287596
2 GBP/USD 2017-06-30 20:59:52 1.301163 1.303646

X2 is holding the timestamps of 1st occurences of the representatives of 5-minute obs lying on the intervals.

score 0 · Answer 3 · answered Dec 19 '17 at 08:50

I think all these would be easier with aggregate function. Though, based on the data, you might need to convert the datetime column to character (in case the original data holds the millisecond values). I recommend using lubridate to convert them back to datetime if you need.

GBPUSD$X2 <- as.character(GBPUSD$X2) #optional; if the below yields bad results
GBPUSD$X2 <- substr(GBPUSD$X2, 1, 19) #optional; to get only upto minutes after above command
# get High values for both bid and ask prices:
GBPUSD_H <- aggregate(cbind(X3, X4)~X1+X2, data=GBPUSD, FUN=max)
# get Low values for both bid and ask prices:
GBPUSD_L <- aggregate(cbind(X3, X4)~X1+X2, data=GBPUSD, FUN=min)
# merging the High and low values together
GBPUSD_NEW <- data.table::merge(GBPUSD_H, GBPUSD_L, by=c("X1", "X2"), suffixes=c(".HIGH", ".LOW"))

To get all High, Low, Open, & Close values in one shot:

GBPUSD <- data.table(GBPUSD, key=c("X1", "X2"))
GBPUSD_NEW <- GBPUSD[, list(X3.HIGH=max(X3), X3.LOW=min(X3), X3.OPEN=X3[1],
                            X3.CLOSE=X3[length(X3)], X4.HIGH=max(X4), X4.LOW=min(X4),
                            X4.OPEN=X4[1], X4.CLOSE=X4[length(X4)]), by=c("X1", "X2")]

However, for this to work, you first need to sort your data so that the first value is the open and last value is the close value for each second.

Now, if you need to use minutes instead of seconds (or hours), just adjust the substr accordingly. If you want more customization, like 15 minutes interval, I would suggest adding a helper column. Sample code:

GBPUSD$MIN <- floor(as.numeric(substr(GBPUSD$X2, 15, 16))/15) #getting 00:00 for 00:00-00:15
GBPUSD$X2 <- paste0(substr(GBPUSD$X2, 1, 14), GBPUSD$MIN, ":00")

Please do not hesitate to ask if your requirement is not fulfilled.

P.S.: NAs create problems in aggregate, if the key columns have them. Deal with them first.

GBPUSD$X2[is.na(GBPUSD$X2)] <- "2017:05:05 00:00:00" #example; you need to be careful to use same class and format for the replacement

score 0 · Answer 4 · answered Dec 20 '17 at 00:16

This is super perfect example when you want to try awesome tibbletime package. I am going to generate my own data to make a point

library(tibbletime)
df <- tibbletime::create_series(2017-12-20 + 01:06:00 ~ 2017-12-20 + 01:20:00, "sec") %>% 
         mutate(open=runif(nrow(.)),
                close=runif(nrow(.)))
df

This is now a seconds-resolution data for 15 min

# A time tibble: 841 x 3
# Index: date
                  date       open       close
 *              <dttm>      <dbl>       <dbl>
 1 2017-12-20 01:06:00 0.63328803 0.357378011
 2 2017-12-20 01:06:01 0.09597444 0.150583962
 3 2017-12-20 01:06:02 0.23601820 0.974341599
 4 2017-12-20 01:06:03 0.71832656 0.092265867
 5 2017-12-20 01:06:04 0.32471587 0.391190310
 6 2017-12-20 01:06:05 0.76378711 0.534765217
 7 2017-12-20 01:06:06 0.92463265 0.694693458
 8 2017-12-20 01:06:07 0.74026638 0.006054806
 9 2017-12-20 01:06:08 0.77064030 0.911641146
10 2017-12-20 01:06:09 0.87130949 0.740816479
# ... with 831 more rows

Changing periodicity of the data is as easy as one command:

as_period(df, 5~M)

This will aggregate data to 5 min intervals (tibbletime picks first observation for every period by default, not average or sum)

# A time tibble: 3 x 3
# Index: date
                 date      open     close
*              <dttm>     <dbl>     <dbl>
1 2017-12-20 01:06:00 0.6332880 0.3573780
2 2017-12-20 01:11:00 0.9235639 0.7043025
3 2017-12-20 01:16:00 0.6955685 0.1641798

Check out this awesome vignette for more details

If "`tibbletime` picks first observation for every period by default, not average or sum" as you stated, then, does that mean that `tibbletime` loses information in observations other than the first observations? To me, all the info in the dataset should be used. — Erdogan CEVHER, Dec 20 '17 at 09:35
Check out the package documentation. There is `time_summarize` and `time_collapse`. In time series aggregation does not always make sense. Imagine you are just making measurements less frequently. Averages will never be matching to real life and can be skewed by outliers. — dmi3kno, Dec 20 '17 at 09:53

R: Decrease frequency of time series data by aggregating values in OHLC series

4 Answers4