1

For example:

require(RevoScaleR)

# Create a data frame
set.seed(100)
myData = data.frame(x = 1:100, y = rep(c("a", "b", "c", "d"), 25),
                     z = rnorm(100), w = runif(100))

# Create a multi-block .xdf file from the data frame
inputFile = file.path(tempdir(), "testInput.xdf")
rxDataStep(inData = myData, outFile = inputFile, rowsPerRead = 50, 
           overwrite = TRUE)

# Square the values in the column "z"; this works fine
rxDataStep(inData = inputFile, outFile = inputFile, overwrite = TRUE,
           transforms = list(z = z^2))

# Define a squaring function and try to use it to repeat the previous step:
myFun = function(x) x^2
rxDataStep(inData = inputFile, outFile = inputFile, overwrite = TRUE,
           transforms = list(z = myFun(z)))

The final step crashes with the error

Error in transformation function: Error in eval(expr, envir, enclos) : could not find function "myFun"

The documentation for rxDataStep states that "As with all expressions, transforms ... can be defined outside of the function call using the expression function." But I have no idea how to implement this advice, and can't find an example. For instance, the following does not work:

myFun = expression(function(x) x^2)
rxDataStep(inData = inputFile, outFile = inputFile, overwrite = TRUE,
           transforms = list(z = myFun(z)))
zkurtz
  • 3,230
  • 7
  • 28
  • 64

2 Answers2

2

You can certainly pass an expression to transform that was created outside of the function call.

It would look something like this:

myFun <- expression(
  list(x2 = x^2,
       z2 = z^2))
rxDataStep(inData = inputFile, outFile = inputFile, overwrite = TRUE,
           transforms = myFun)

If you want to pass a function as you have in your first example, it would look something like this:

myFun2 <- function(dataList){
  dataList$x2 <- dataList$x^2
  dataList$z2 <- dataList$z^2
  dataList
}
rxDataStep(inData = inputFile, outFile = inputFile, overwrite = TRUE,
           transformFunc = myFun2)
1

No idea why this works!

env <- new.env()
env$myFun <- function(x) x^2
rxDataStep(inData = inputFile, outFile = inputFile, overwrite = TRUE,
           transforms = list(z = myFun(z)), transformEnvir=env) 
zkurtz
  • 3,230
  • 7
  • 28
  • 64
  • Basically, `transforms` only has access to the variables in `inputFile` and any R objects that you pass to it. When you set the `transformEnvir`, it makes everything in that environment (ie `myFun`) available to `transforms`. Derek's answer uses `transformFunc` to achieve the same result. This confused for a while until I realized that Revolution R is meant for working in a distributed context - and those nodes won't have access to anything in your R session unless you send it to them. Hence the `transformFunc`, `transformObjects`, `transformEnvir`, etc. – Matt Parker Jun 16 '15 at 18:10
  • The alternative would be to make all of the objects in your R session available to all of your nodes... which could really slow things down. – Matt Parker Jun 16 '15 at 18:12