I just start trying the R package mlr, I am wondering if I can customize training set and test set. For example, all the data of a time sequence are the training set except for the last,and the last one is the test set.
Here is my example:
library(mlr)
library(survival)
data(lung)
myData2 <- lung %>%
select(time,status,age)
myData2$status = (myData2$status == 2)
myTrain <- c(1:(nrow(myData2)-1))
myTest <- nrow(myData2)
Lung data is from survival package. I just use three dimensions: time, status and age. Now, let's suppose they do not mean the patients' ages and how long they can survive. Let's say this is a ink purchase history of one customer.
age=74 means this customer bought 74 bottles of ink on that day and time=306 means the customer run out the ink after 306 days. So, I want to build up a survival model using all the data except for the last row. Then, when I have the data of the last row, which is age=58 implying the customer bought 58 bottles of ink on that day, I can make a prediction on time. A number close to 177 will be a good estimation. So, my training set and test set are fixed, which does not need to be resampled.
In addition, I need to change the hyperparameters for a comparison. Here is my code:
surv.task <- makeSurvTask(data=myData2,target=c('time','status'))
surv.lrn <- makeLearner("surv.cforest")
ps <- makeParamSet(
makeDiscreteParam('mincriterion',values=c(1.281552,2,3)),
makeDiscreteParam('ntree',values=c(100,200,300))
)
ctrl <- makeTuneControlGrid()
rdesc <- makeResampleDesc('Holdout',split=1,predict='train')
lrn = makeTuneWrapper(surv.lrn,control=ctrl,resampling=rdesc,par.set=ps,
measures = list(setAggregation(cindex,train.mean)))
mod <- train(learner=lrn,task=surv.task,subset=myTrain)
surv.pred <- predict(mod,task=surv.task,subset=myTest)
surv.pred
You can see that I use split=1
in makeResampleDesc
because I have fixed training set which does not need to be resampled. measures in makeTuneWrapper
is currently not meaningful to me as I need to customize my own measures. Because of fixed data split, I can not use the functions like resample
or tuneParams
to get an evaluation on test data when using different hyperparameters.
So, my question is: when the training set and test set are fixed, can mlr provide a comprehensive compare for every hyperparameter? If so, how to do it?
Incidentally, looks like there is function makeFixedHoldoutInstance
which might can do this, just do not know how to use it. For example, I use makeFixedHoldoutInstance
in this way and I have got such error information:
> f <- makeFixedHoldoutInstance(train.inds=myTrain,test.inds=myTest,size=length(myTrain)+1)
> lrn = makeTuneWrapper(surv.lrn,control=ctrl,resampling=f,par.set=ps)
> resample(learner=lrn,task=surv.task,resampling=f)
[Resample] holdout iter 1: [Tune] Started tuning learner surv.cforest for parameter set:
Type len Def Constr Req Tunable Trafo
mincriterion discrete - - 1.281552,2,3 - TRUE -
ntree discrete - - 100,200,300 - TRUE -
With control class: TuneControlGrid
Imputation value: -0
[Tune-x] 1: mincriterion=1.281552; ntree=100
Error in resample.fun(learner2, task, resampling, measures = measures, :
Size of data set: 227 and resampling instance: 228 differ!