1

The function below takes a folder of CSV files (each file is a financial time series with datetime, open, high, low, close columns) and creates a single XTS object for each of the open, high, low, close prices, where each XTS column is an individual security. For my use case, this representation allows for much more convenient and faster processing (vs. single XTS for each file).

require(quantmod)

LoadUniverseToEnv <- function(srcDir, env) {
  fileList <- list.files(srcDir)
  if (length(fileList) == 0)
    stop("No files found!")

  env$op <- NULL
  env$hi <- NULL
  env$lo <- NULL
  env$cl <- NULL
  cols <- NULL

  for (file in fileList) {
    filePath <- sprintf("%s/%s", srcDir, file)
    if (file.info(filePath)$isdir == FALSE) {
      x <- as.xts(read.zoo(filePath, header=TRUE, sep=",", tz=""))
      cols <- c(sub("_.*", "", file), cols)
      # do outer join
      env$op <- merge(Op(x), env$op)
      env$hi <- merge(Hi(x), env$hi)
      env$lo <- merge(Lo(x), env$lo)
      env$cl <- merge(Cl(x), env$cl)
      cat(sprintf("%s : added: %s from: %s to: %s\n", as.character(Sys.time()), file, start(x), end(x)))
    }
  }
  colnames(env$op) <- cols
  colnames(env$hi) <- cols
  colnames(env$lo) <- cols
  colnames(env$cl) <- cols
}

Performance is fine for a limited number of files, but slows linearly with the width of the XTS object and so becomes a problem for large datasets. The bottleneck is CPU during the merge, when a new column is being appended to each of the four objects (e.g. 100ms initally slowing by 1ms/column)

Since it's CPU bound, my first thought is to parallelize by merging n batches of files and then merge the results, but I'm wondering if there's a better way.

hendalst
  • 2,957
  • 1
  • 24
  • 25

1 Answers1

0

The best solution I found for this was to merge in "chunks". For example, assuming 100 columns, merging into 10 XTS objects with 10 columns each and then merging those 10 objects dramatically improves performance.

The below example shows a 1500% improvement when merging 2000 xts objects with 1000 rows each and identical indexes.

Example:

require(xts)
require(foreach)

nCols <- 2000
nRows <- 1000

x <- xts(runif(nRows), order.by=as.Date(seq(1:nRows)))
xList <- list()
for (i in 1:nCols)
  xList[[i]] <- x

testA <- function() {
  merged <- NULL
  for (x in xList)  
    merged <- merge(x, merged)
  colnames(merged) <- 1:length(xList)
  merged
}

testB <- function() {
  nChunks <- floor(sqrt(length(xList)))
  idx <- split(1:n, sort(1:n %% nChunks))

  merged <- foreach (chunk = 1:nChunks, .combine = "merge") %do% {
    merged <- foreach (i = idx[[chunk]], .combine = "merge") %do% {
      xList[[i]]
    }
    merged
  }
  colnames(merged) <- 1:length(xList)
  merged
}

print("Test A")
print(system.time(resultA <- testA()))
print("Test B")
print(system.time(resultB <- testB()))
print(sprintf("Identical : %s", identical(resultA, resultB)))
print(sprintf("Dimensions: %dx%d", ncol(resultA), nrow(resultA)))

Output:

[1] "Test A"
   user  system elapsed
  33.12    3.18   36.30
[1] "Test B"
   user  system elapsed
   2.28    0.01    2.31
[1] "Identical : TRUE"
[1] "Dimensions: 2000x1000"

Note that the foreach is not running in parallel.

hendalst
  • 2,957
  • 1
  • 24
  • 25