The function below takes a folder of CSV files (each file is a financial time series with datetime, open, high, low, close columns) and creates a single XTS object for each of the open, high, low, close prices, where each XTS column is an individual security. For my use case, this representation allows for much more convenient and faster processing (vs. single XTS for each file).
require(quantmod)
LoadUniverseToEnv <- function(srcDir, env) {
fileList <- list.files(srcDir)
if (length(fileList) == 0)
stop("No files found!")
env$op <- NULL
env$hi <- NULL
env$lo <- NULL
env$cl <- NULL
cols <- NULL
for (file in fileList) {
filePath <- sprintf("%s/%s", srcDir, file)
if (file.info(filePath)$isdir == FALSE) {
x <- as.xts(read.zoo(filePath, header=TRUE, sep=",", tz=""))
cols <- c(sub("_.*", "", file), cols)
# do outer join
env$op <- merge(Op(x), env$op)
env$hi <- merge(Hi(x), env$hi)
env$lo <- merge(Lo(x), env$lo)
env$cl <- merge(Cl(x), env$cl)
cat(sprintf("%s : added: %s from: %s to: %s\n", as.character(Sys.time()), file, start(x), end(x)))
}
}
colnames(env$op) <- cols
colnames(env$hi) <- cols
colnames(env$lo) <- cols
colnames(env$cl) <- cols
}
Performance is fine for a limited number of files, but slows linearly with the width of the XTS object and so becomes a problem for large datasets. The bottleneck is CPU during the merge, when a new column is being appended to each of the four objects (e.g. 100ms initally slowing by 1ms/column)
Since it's CPU bound, my first thought is to parallelize by merging n batches of files and then merge the results, but I'm wondering if there's a better way.