How to merge a daily xts into a sparse time-indexed xts?

Question

I have a multi-column xts object, which goes to second accuracy. I then have another xts object which contains one value for each day. I'd like to add that daily value as a column in the main xts object. Here is an example:

Sys.setenv(TZ = "UTC") 
library(xts)

set.seed(777)

xt = xts( data.frame(A=1:20,B=201:220,C=round(runif(20)*10,1)),
  order.by = as.POSIXct("2015-06-21") + (runif(20) * 86400 * 14) )

xd = xts( round(runif(14) - 0.5,1), as.Date("2015-06-21") +  (1:14))

Using merge does not work: the xd entries get given a "00:00:00" timestamp, and therefore none of them match, so I get an xts object with lots of NAs:

                     A   B    C   xd
2015-06-21 10:04:36  5 205  7.0   NA
2015-06-22 00:00:00 NA  NA   NA -0.5
2015-06-23 00:00:00 NA  NA   NA -0.2
2015-06-23 11:42:38  4 204 10.0   NA
2015-06-24 00:00:00 NA  NA   NA  0.1
...

Expected result:

                     A   B    C   xd
2015-06-21 10:04:36  5 205  7.0   NA
2015-06-23 11:42:38  4 204 10.0 -0.2
2015-06-24 21:16:18 18 218  8.7  0.1
2015-06-25 02:30:24 15 215  8.7 -0.2
2015-06-25 07:48:42 16 216  1.0 -0.2
2015-06-25 15:04:34 14 214  5.9 -0.2
2015-06-26 07:50:09  1 201  6.9 -0.3
2015-06-27 19:28:33  7 207  3.5  0.5
...

Points:

Real data will be much larger than this example, so excessive use of memory and CPU is to be avoided.
As shown above, NAs or missing dates are possible in xd (though relatively rare).
There are some days not represented in xt (as shown by the missing 2015-06-22 above). I don't want an entry created for such days. (I suppose I could use na.omit to remove them, but it is possible - and not shown in my simple example - that I have some real NAs in the data that I don't want removed.)

UPDATE: As an example of NAs in the original data getting lost, consider xt[10,'B'] <- NA. Using Joshua's merge(xt, xd, fill=na.locf)[index(xt)] solution, 2015-06-28 19:41:45 ends up as 8 203 1.7 0.4 when it should be 8 NA 1.7 0.4. Whether that is a problem or not will depend on what xt will be used for next. FXQuantTrader's answer shows a workaround to preserve NAs using a magic number, which is turned back into an NA at the end. One alternative (that uses more memory) is to take a copy of any columns containing NAs, and replace the whole column afterwards.

BTW, my "avoid excessive memory" point was in relation to an idea I (briefly) had to use `na.locf` and `seq` to create one entry for every second in `xd`, so that the merge would work (e.g. http://stackoverflow.com/a/8981517/841830 ) That would require 86,400 as many rows (and gets even sillier if `xt` uses milliseconds)! — Darren Cook, Aug 06 '15 at 21:37

Joshua Ulrich · Answer 1 · 2015-08-07T16:27:20.557

3

The "canonical" way I would do something like this is:

Merge the two objects.
Call na.locf on the result.
Subset that result so it only contains the desired index values.

You can do the na.locf call via the merge.xts function's fill argument. For example:

xtd <- merge(xt, xd, fill=na.locf)[index(xt)]

That's the easiest solution I can think of, which may not necessarily be the most performant. Let me know if it's suitable for your use case. If not, I'll have to spend some time thinking about a more efficient solution.

We can use the same paradigm if xt has NA that need to be preserved, but we need to run na.locf on the columns in xd only.

xtd <- merge(xt, xd)
xtd[,"xd"] <- na.locf(xtd[,"xd"])
xtd <- xtd[index(xt)]

edited Aug 07 '15 at 16:27

answered Aug 07 '15 at 11:20

Joshua Ulrich

173,410
32
338
418

Thanks - that is very easy. Though it does go wrong when `xt` already contains `NA` values that should be preserved (I just edited my question to show an example). – Darren Cook Aug 07 '15 at 13:46
2

@DarrenCook: That's easy to resolve. Just do the `na.locf` outside of the `merge` call. I'll edit my answer. – Joshua Ulrich Aug 07 '15 at 16:23

FXQuantTrader · Answer 2 · 2015-08-07T14:46:00.133

To get what you want, you want to merge the xd timestamps onto the xd time stamps exactly (to the subsecond level). So one approach is to find the first time stamp in xt on each calendar day in your timezone, and use that timestamp as the index value for the day in xd.

Under the assumption that you know the value of xd at the start of the trading day (at 00:00:00) you could do something like this (the code below wants minor tweaking if you only know the value of xd at the end of the trading day):

Sys.setenv(TZ = "UTC") 
library(xts)

set.seed(777)

library(lubridate)
xt = xts( data.frame(A=1:20,B=201:220,C=round(runif(20)*10,1)),
          order.by = as.POSIXct("2015-06-21") + (runif(20) * 86400 * 14) )

# Use consistent time index ordering (both POSIXct):
xd = xts( round(runif(14) - 0.5,1), as.POSIXct("2015-06-21") +  days(1:14))

# since xd elements are randomly created each time:
xd2 <- xd

# get first timestamp of each day in xt:
first_each_day <- .indexday(xt)
first_each_day_ndup <- !duplicated(first_each_day)
first_each_day_ndup.i <- which(first_each_day_ndup) # this row is the first for each day

xt_sub <- xt[first_each_day_ndup.i]

xt_sub_floor_dates <- floor_date(index(xt_sub), "day")
xd_date_eq_xt_date.i <- which(index(xd2) %in% xt_sub_floor_dates)
switch2.i <- which(xt_sub_floor_dates %in% index(xd2))


# Set xd time to the first timestamp in xt for the day, if it exists in xt:
xdtmp <- xd2[xd_date_eq_xt_date.i,]
index(xdtmp) <- index(xt_sub[switch2.i,])

# xts merge trick -- name new column at the same time as merging all in one statement:
res <- merge(xt, dailyvalue = drop(xdtmp))
res[, "dailyvalue"] <- na.locf(res[, "dailyvalue"])

Regarding your points:

If there is no row in xt that corresponds to a date in xd, a row is not added in xt with this approach (i.e. your point 3 is addressed).

You have not mentioned how you would like to handle NAs that might pre-exist in xd before merging xd and xt, but one way of being aware of NAs in xt after merging would be to set values in xd which are NA to a common sense unused numeric value like -Inf, so one can still use na.locf to fill the dailyvalue column appropriately in xt

# suppose NA exists in xd at row 7:

xd2[7,] <- NA

# Set a replacement dummy numeric value for recognising NAs in the dailyvalue column. e.g. Inf

xd2[is.na(xd2),] <- -Inf

# Now repeat the above code for merging:
xdtmp <- xd2[xd_date_eq_xt_date.i,]
index(xdtmp) <- index(xt_sub[switch2.i,])

# merge and name new column at the same time all in one statement:
res <- merge(xt, dailyvalue = drop(xdtmp))
res[, "dailyvalue"] <- na.locf(res[, "dailyvalue"])

# backfill NAs in dailyvale if necessary:
res[!is.finite(res[, "dailyvalue"]), "dailyvalue"] <- NA
print(res)

Finally, if you're missing dates in xd, but know how you would like to back fill their values, you could just add those unknown dates to xd before doing the above code to merge.

Thanks for the effort in that answer! But surely it shouldn't be that complicated? xts index handling is so sophisticated, surely it has something to help. BTW, using -9999 to temporarily mean NA is quite risky - you never know when code is going to be used for data with a different range to what you have seen up to now. (I want to say `-Inf` or `NaN` are safer than -9999, but they are just as vulnerable I suppose.) — Darren Cook, Aug 07 '15 at 08:33
On the point of a temporary NA replacement, you're right, Inf is a cleaner choice, and works with na.locf. Although NaN doesn't seem to work with na.locf. — FXQuantTrader, Aug 07 '15 at 14:50

How to merge a daily xts into a sparse time-indexed xts?

2 Answers2