I am experiencing difficulties trying to learn a model on the time series data. For this purpose I decided to use mlr3
framework and specifically mlr3tuning::AutoTuner
function. The whole setup looks like this:
at <- mlr3tuning::AutoTuner$new(
learner = mlr3::lrn("classif.xgboost"),
resampling = mlr3::rsmp("RollingWindowCV", window_size = 86400, horizon = 28800, folds = 24, fixed_window = F),
measure = mlr3::msr("classif.costs", costs = costs),
search_space = ps,
terminator = mlr3tuning::trm("clock_time", stop_time = as.POSIXct("2021-08-13 10:00:00")),
tuner = mlr3tuning::tnr("random_search")
)
The error message I am getting looks like this:
Error in .__Archive__add_evals(self = self, private = private, super = super, :
Assertion on 'ydt[, self$cols_y, with = FALSE]' failed: Contains missing values (column 'classif.costs', row 1).
I tried to handle the issue by myself and this is what I have tried:
At first I have attempted easy solution, if error message states that there is something wrong with
msr("classif.costs", costs = costs)
let's change it tomsr("classif.acc")
. But all it did what changemeasure
in the error message.Secondly, I made sure there was no
NA
,NaN
,Inf
or-Inf
in my train set but the next try also have yield identical error message.> df <- task$data() > sapply(df, function(x) sum(is.na(x))) %>% sum [1] 0 > sapply(df, function(x) sum(is.nan(x))) %>% sum [1] 0 > sapply(df, function(x) sum(is.infinite(x))) %>% sum [1] 0
Finally, I came across a similar issue resolved at mlr3's github: Error on missing values without missing values The issue was found and describes as: very unbalanced dataset cause some of the cross-validation resamples not to include all of the labels. So I started checking if that also applies to my problem:
Imbalance first - so the data is so what imbalanced but I frankly don't think it could create incomplete (label wise) cv resampling group.
> df[[task$target_names]] %>% table . -1 0 1 133024 413200 123584
Resampling itself - if we take a look at resampling scheme it will appear clearly that the highest chances of creating problems causing group are in the one of the test groups. Each of them contains of 28800 observations but let's take a look at whole of them. Code above indicates that each one of theme has full set of labels.
Disclaimer
I am aware that those are randomly split and but after houndreds of repetition I was still unable to find the one without full set of labels.> resample$instantiate(task) > rs <- resample$instance > sapply(1:24, function(x) df[[task$target_names]][rs$train[[x]]] %>% unique %>% length) [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 > sapply(1:24, function(x) df[[task$target_names]][rs$test[[x]]] %>% unique %>% length) [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
But my thought process might be faulty and resampling might be the issue I can not train the model. The only problem with that assumptionI have is that error occurs at xth evaluation. So, the problem is elswhere or resampling is rerun in each of the tuning evaluation until it creates incomplete group and yield an error, is that possible?
I did try to test this with constant hyperparameters on iris set with random labels but my results was indecisive. So I am still asking a question what am I doing wrong?
Anyway, thanks for any answers, Cheers!