2

For Revolution R Enterprise users, is there a way to apply a function to a factor level of an .xdf, in say, rxCube()? I know transforms let's you operate on the data pre tabulation, but it seems to me you can only get (count, sum, mean).

For example, I want to find the row that has the minimum value of a particular variable, conditional on a industry * year.

The only solution I can think of is to rxSplit() the data, sort by variables you want, and then do what you will. I am sure the reason one can't do this is too many integrity conditions / the supported tabulation functions are actually in fact optimized in C, and using your own function would be more complicated and terribly slow.

It would be amazing to basically have an out-of-memory data.table.

Andrie
  • 176,377
  • 47
  • 447
  • 496
grad student
  • 107
  • 7

1 Answers1

3

What you describe is not easily doable with a single function from RevoScaleR. What you describe with rxSplit is one way. Here, is a comparison of the results with that of aggregate in-memory to show they are the same.

set.seed(1234)
myData <- data.frame(year = factor(sample(2000:2015, size = 100, replace = TRUE)),
                     x = rnorm(100))
xdfFile <- rxDataStep(inData = myData, outFile = "test.xdf", rowsPerRead = 10)

newDir <- file.path(getwd(), "splits")
dir.create(newDir)
splitFiles <- rxSplit(inData = xdfFile, 
                      outFilesBase = paste0(newDir, "/", gsub(".xdf", "",
                                            basename(xdfFile@file))), 
                      splitByFactor = "year")

minFun <- function(xdf) {
  dat <- rxDataStep(inData = xdf, reportProgress = 0)
  data.frame(year = dat$year[1], minPos = which.min(dat$x))
}
minPos <- do.call(rbind, lapply(splitFiles, minFun))
row.names(minPos) <- NULL

minPos
aggregate(x ~ year, data = myData, FUN = which.min

The above does assume that the data in each group can fit into RAM. If that is not the case, some tweaking would be required.

There is one other solution given the assumption that the individual groups can fit into RAM, and that is the use of the RevoPemaR package.

library("RevoPemaR")

rxSort(inData = xdfFile, outFile = xdfFile, sortByVars = "year", overwrite = TRUE)

byGroupPemaObj <- PemaByGroup()
minByYear <- pemaCompute(pemaObj = byGroupPemaObj, data = xdfFile, 
                       groupByVar = "year", computeVars = "x", 
                       fnList = list(
                         minPos = list(FUN = which.min, x = NULL)))

minPos
  • That's not really the same. Finding the min value is not the same as finding the row where the min value lies. From your code, it appears that the obvious problem of rxDataStep is that it can only evaluate "minFun" within the scope of the chunk. The answer could only come from if one were able to collapse via rxSummary, where each chunk was *guaranteed* to be the subset you wanted it to be. The only way to do that, it seems to me, is to rxSplit the file, but I don't see why Revolution Analytics can't use indexing once rxFactors is applied to evaluate chunks just inside the factor. – grad student Feb 15 '15 at 05:59
  • From a glance, it seems you are from the Revo R team. I love your product! I just was suggesting something that seemed to make sense to me. – grad student Feb 15 '15 at 06:03
  • @APK, that make sense, and may well be the case one day, but now what you describe with `rxSplit` will work. I misread your question, so I will edit the answer to give the row of the min you are looking for. – Derek McCrae Norton Feb 17 '15 at 17:09
  • @APK does this satisfy your requirement? – Derek McCrae Norton Feb 20 '15 at 17:43
  • Did not know about RevoPemaR. This is, if I understand what it's doing, unbelievable! Thanks! – grad student Feb 21 '15 at 22:54
  • @APK does my answer solve the question you asked, or is there further clarification needed? RevoPemaR is pretty cool. – Derek McCrae Norton Feb 26 '15 at 21:18
  • Hi, @APK If you find an answer useful, please consider marking the answer as accepted by ticking the green arrow. See http://stackoverflow.com/help/accepted-answer – Andrie Feb 27 '15 at 20:46