I'm working on subsets of data from multiple time periods and I'd like to do column and level reduction on my training set and then apply the same actions to other datasets of the same structure.
dataframeReduce in the Hmisc package is what I've been using, but applying the function to different dataset results in slightly different actions.
trainPredictors<-dataframeReduce(trainPredictors,
fracmiss=0.2, maxlevels=20, minprev=0.075)
testPredictors<-dataframeReduce(testPredictors,
fracmiss=0.2, maxlevels=20, minprev=0.075)
testPredictors<-testPredictors[,names(trainPredictors)]
The final line ends up erroring because the backPredictors has a column removed that trainPredictors does retains. All other sets should have the transformations applied to trainPredictors applied to them.
Does anyone know how to apply the same cleanup actions to multiple datasets either using dataframeReduce or another function/block of code?
An example
Using the function NAins from http://trinkerrstuff.wordpress.com/2012/05/02/function-to-generate-a-random-data-set/
NAins <- NAinsert <- function(df, prop = .1){
n <- nrow(df)
m <- ncol(df)
num.to.na <- ceiling(prop*n*m)
id <- sample(0:(m*n-1), num.to.na, replace = FALSE)
rows <- id %/% m + 1
cols <- id %% m + 1
sapply(seq(num.to.na), function(x){
df[rows[x], cols[x]] <<- NA
}
)
return(df)
}
library("Hmisc")
trainPredictors<-NAins(mtcars, .1)
testPredictors<-NAins(mtcars, .3)
trainPredictors<-dataframeReduce(trainPredictors,
fracmiss=0.2, maxlevels=20, minprev=0.075)
testPredictors<-dataframeReduce(testPredictors,
fracmiss=0.2, maxlevels=20, minprev=0.075)
testPredictors<-testPredictors[,names(trainPredictors)]