3

If I want to make a prediction on new data using the mlr package, how can I preprocess the new data so that the information necessary from the preprocessing of the original data is being used. E.G. if I merge small factor levels and the frequencies in the new data set are different from the first data set, the resulting factor levels may differ and a prediction is not possible. Note: I am assuming here that at the time of training the model the new data is not yet available, this is not about test data, but about predicting for new data. So how is preprocessing of new data supposed to be done in mlr? Here is an example where I created a new task to preprocess the new data set which leads to an error:

library(mlr)
a <- data.frame(y=factor(c(1,1,1,1,1,1,1,1,0,0,1,0)), 
                x1=rep(c("a","b", "c"), times=c(10,1,1)))
# most frequent x1 factor is "a"
aTask <- makeClassifTask(data = a, target = "y", positive="1")
aTask <- mergeSmallFactorLevels(aTask, cols=c("x1"), min.perc=0.1)
# combines "b" and "c" into factor ".merged"
getTaskData(aTask)

aLearner <- makeLearner("classif.rpart", predict.type = "prob")
model <- train(aLearner, aTask)

b <- data.frame(y=factor(c(1,0,1,1,1,1,1,1,0,0,1,0)), 
                x1=rep(c("a","b", "c"), times=c(1,10,1)))
# most frequent x1 factor is "b"
# target would be made up, because at this stage there would be now target
# variable availabel
newdataTask <- makeClassifTask(data = b, target = "y", positive="1")
newdataTask <- mergeSmallFactorLevels(newdataTask, cols="x1", 
                                      min.perc = 0.1)
# combines "a" and "c" into factor ".merged"
getTaskData(newdataTask)

pred <- predict(model, newdataTask)

#Error in model.frame.default(Terms, newdata, na.action = na.action, 
#                              xlev = attr(object,  : 
#Faktor 'x1' hat neue Stufen b (= factor 'x1' has new level b)

Another problem with my solution is that a new task seems to require a target variable which would not be available for new data sets.

tover
  • 535
  • 4
  • 11

1 Answers1

2

mlr doesn't offer anything to do this automatically, but you can easily check which factor levels have been replaced and rename accordingly in the new data:

library(plyr)
to.replace = setdiff(levels(b$x1), levels(getTaskData(aTask)$x1))
b$x1 = mapvalues(b$x1, from = to.replace, to = rep(".merged", times = length(to.replace)))

Complete example:

library(mlr)
a = data.frame(y=factor(c(1,1,1,1,1,1,1,1,0,0,1,0)), 
                x1=rep(c("a","b", "c"), times=c(10,1,1)))
aTask = makeClassifTask(data = a, target = "y", positive="1")
aTask = mergeSmallFactorLevels(aTask, cols=c("x1"), min.perc=0.1)

aLearner = makeLearner("classif.rpart", predict.type = "prob")
model = train(aLearner, aTask)

b = data.frame(y=factor(c(1,0,1,1,1,1,1,1,0,0,1,0)), 
                x1=rep(c("a","b", "c"), times=c(1,10,1)))
library(plyr)
to.replace = setdiff(levels(b$x1), levels(getTaskData(aTask)$x1))
b$x1 = mapvalues(b$x1, from = to.replace, to = rep(".merged", times = length(to.replace)))

newdataTask = makeClassifTask(data = b, target = "y", positive="1")

pred = predict(model, newdataTask)

For things like this it's often better to fuse a learner with the preprocessing so that this happens transparently and automatically when you train and predict. In this case, a complete example would look something like this:

lrn = makeLearner("classif.rpart", predict.type = "prob")
trainfun = function(data, target, args) {
    task = makeClassifTask(data = data, target = target, positive = "1")
    new.task = mergeSmallFactorLevels(task, cols = c("x1"), min.perc = 0.1)
    return(list(data = getTaskData(new.task), control = list(levels(getTaskData(new.task)$x1))))
}
predictfun = function(data, target, args, control) {
    library(plyr)
    to.replace = setdiff(levels(data$x1), control[[1]])
    data$x1 = mapvalues(data$x1, from = to.replace, to = rep(".merged", times = length(to.replace)))
    return(data)
}
lrn = makePreprocWrapper(lrn, train = trainfun, predict = predictfun)

a = data.frame(y=factor(c(1,1,1,1,1,1,1,1,0,0,1,0)), 
                x1=rep(c("a","b", "c"), times=c(10,1,1)))
aTask = makeClassifTask(data = a, target = "y", positive="1")
model = train(lrn, aTask)

b = data.frame(y=factor(c(1,0,1,1,1,1,1,1,0,0,1,0)), 
                x1=rep(c("a","b", "c"), times=c(1,10,1)))
newdataTask = makeClassifTask(data = b, target = "y", positive = "1")
pred = predict(model, newdataTask)

This is only a proof of concept -- you'd probably want to have arguments for specifying which features should be processed and what the threshold should be, and adapt the predictfun code to handle an arbitrary number of processed features.

Lars Kotthoff
  • 107,425
  • 16
  • 204
  • 204