0

I'm working on subsets of data from multiple time periods and I'd like to do column and level reduction on my training set and then apply the same actions to other datasets of the same structure.

dataframeReduce in the Hmisc package is what I've been using, but applying the function to different dataset results in slightly different actions.

trainPredictors<-dataframeReduce(trainPredictors, 
                  fracmiss=0.2, maxlevels=20,  minprev=0.075)
testPredictors<-dataframeReduce(testPredictors, 
                  fracmiss=0.2, maxlevels=20,  minprev=0.075)
testPredictors<-testPredictors[,names(trainPredictors)]

The final line ends up erroring because the backPredictors has a column removed that trainPredictors does retains. All other sets should have the transformations applied to trainPredictors applied to them.

Does anyone know how to apply the same cleanup actions to multiple datasets either using dataframeReduce or another function/block of code?




An example

Using the function NAins from http://trinkerrstuff.wordpress.com/2012/05/02/function-to-generate-a-random-data-set/

NAins <-  NAinsert <- function(df, prop = .1){
  n <- nrow(df)
  m <- ncol(df)
  num.to.na <- ceiling(prop*n*m)
  id <- sample(0:(m*n-1), num.to.na, replace = FALSE)
  rows <- id %/% m + 1
  cols <- id %% m + 1
  sapply(seq(num.to.na), function(x){
    df[rows[x], cols[x]] <<- NA
  }
  )
  return(df)
}
library("Hmisc")
trainPredictors<-NAins(mtcars, .1) 
testPredictors<-NAins(mtcars, .3)
trainPredictors<-dataframeReduce(trainPredictors, 
                                 fracmiss=0.2, maxlevels=20,  minprev=0.075)
testPredictors<-dataframeReduce(testPredictors, 
                                fracmiss=0.2, maxlevels=20,  minprev=0.075)
testPredictors<-testPredictors[,names(trainPredictors)]
Steph Locke
  • 5,951
  • 4
  • 39
  • 77
  • It's not possible to tell which of the several conditions being tested is triggering the removal if you offer no data. – IRTFM May 30 '13 at 17:09
  • I didn't post a reproducible example as I was aiming towards the overall style/pattern of solution that ought to be applied in the situation. So something more like `[CODE X for cleaning initial dataset];[CODE Y for capturing changes performed by CODE X];[CODE Z for applying changes identified in CODE Y]` rather than making it specific to a dataset. I will however work up an example. – Steph Locke May 31 '13 at 08:19
  • @DWin please let me know if the reproducible example is sufficient – Steph Locke May 31 '13 at 14:09
  • It's clear why all of your testPredictors columns in the example are being removed. They have an NA proportion of 0.3 and you are removing one with a proportion greater than 0.2. It occurs to me that you could just use `testPredictors <- testPredictors [names(trainPredictors)]` and skip the "NA-reduction" on the test set. – IRTFM May 31 '13 at 16:03
  • @DWin - yes it is clear, however, I would like to be able to perform **all** the actions determined by the first execution, including all the merging of categorical levels into an OTHER category, after it has been determined based on dataset trainPredictors. – Steph Locke Jun 03 '13 at 08:15

1 Answers1

0

If your goal is to have the same variables with the same levels, then you need to avoid using dataframeReduce a second time, and instead use the same columns as produced by the dataframeReduce operation on hte train-set and apply factor reduction logic to the test-set in a manner that results in whatever degree of homology is needed of subsequent comparison operations. If it is a predict operation that is planned then you need to get the levels to be the same and you need to modify the code in dataframeReduce that works on the levels:

    if (is.category(x) || length(unique(x)) == 2) {
        tab <- table(x)
        if ((min(tab)/n) < minprev) {
            if (is.category(x)) {
              x <- combine.levels(x, minlev = minprev)
              s <- "grouped categories"
              if (length(levels(x)) < 2) 
                s <- paste("prevalence<", minprev, sep = "")
            }
            else s <- paste("prevalence<", minprev, sep = "")
        }
    }

So a better problem statement is likely to produce a better strategy. This will probably require both knowing what levels are in the entire set and in the train and test sets as well as what testing or predictions are anticipated (but not yet stated).

IRTFM
  • 258,963
  • 21
  • 364
  • 487