0

I've been looking to aggregate values present in different chunks in the xdf file, but I'm unable to get it to work.

Would any of you have a code snippet where you've used any apply function inside of a transform in an rxDataStep?

smci
  • 32,567
  • 20
  • 113
  • 146
Arun Jose
  • 365
  • 4
  • 16
  • No such package on CRAN. Where's it from? But in any case, what is the data structure in the xdf file? If we know that, we can suggest ways to use `*apply` functions. – Carl Witthoft Oct 31 '13 at 11:33
  • I don't think you can aggregate in rxDataStep. Could you provide an example of what you want to achieve as if your data were in regular data.frame? – Alex Vorobiev Oct 31 '13 at 19:41
  • 1
    @ Carl: the revoScaler Package, is part of the Revolution R Enterprise release, so you won't find it on CRAN. The .xdf file is the native external memory data format used by the package. You would use it if your data is larger than your memory. – Arun Jose Nov 03 '13 at 17:16
  • @ Alex:My main issue, is when my data is imported as a xdf, it gets broken into multiple chunks. When I try to perform an aggregation, example: have product information, and the associated inventory for each calendar day. My data isn't sorted by product, so one product's information is literally spread out over multiple chunks in the xdf. When I try to count the inventory stocked per product, per year, I don't get an aggregated result, but I get counts which are split by each chunk. Any workround for this? – Arun Jose Nov 03 '13 at 17:20
  • I am trying to figure out aggregations for xdf files myself. If all you need is counts you can probably use rxCube function: rxCube(~ product:day, data=your.xdf) – Alex Vorobiev Jan 29 '14 at 16:31

1 Answers1

1

Apply a transform function using transformFunc. You have to have packages you need installed on the worker nodes. Use transformObjects to give functions to the transformFunc.

xformFunction <- function(data) {
  require(dplyr)
  df <- as.data.frame(data)
  df <- dplyr::summarise(dplyr::group_by(df, z))
  return(df)
}

rxDataStep(inData = input_xdf, outFile = t_xdf, transformFunc = xformFunction, transformPackages = c("dplyr"), overwrite = TRUE)

Aggregation will be on the node, so you will get duplicate z values when using Spark ComputeContext.

edeg
  • 76
  • 2