1

How can we Fix a PipeOp's $state, so that its parameters or config are fixed from the beginning and remain the same in both training and prediction.

task = tsk("iris")
pos1 = po("scale", param_vals =list(
    center = T,
    scale = T,
    affect_columns = selector_name("Sepal.Width")))

pos1$state
pos1$state$center <- c(Sepal.Width = 0) 
pos1$state$scale <- c(Sepal.Width = 2) 
 
graph <- pos1 %>>% lrn("classif.xgboost", eval_metric = "mlogloss")
gl <- GraphLearner$new(graph)
gl$train(task)
gl$state

In the code above, the parameters center and scale from po("scale") are recalculated based on the data even when I try to fix them as zero and two (not sure whether I did this correctly), respectively.

Nip
  • 387
  • 4
  • 11

1 Answers1

2

A PipeOp's $state should never be manually changed. I.e., it is more like a logging slot for you to inspect and where the PipeOp finds all the information it needs to carry out its prediction step after being trained.

PipeOpScale will always scale the training data to mean 0 and scales them by their root-mean-square (see ?scale) and stores the "learned" parameters (i.e., mean and root-mean-square of the training data, e.g., the attributes returned by the scale function) as the $state. During prediction, the data will be transformed analogously resulting in a probably different mean and root-mean-square.

Assuming you want to scale "Sepal.Width" to mean 0 and root-mean-square 2 both during training and prediction (as suggested by your code above; but this may be a bad idea), you can use PipeOpColApply:

f = function(x) {
  scale(x)[, 1] * 2 + 0
}

task = tsk("iris")
pos = po("colapply", applicator = f, affect_columns = selector_name("Sepal.Width"))

train_out = pos$train(list(task))[[1]]$data(cols = task$feature_names)
round(colMeans(train_out), 2)
round(apply(train_out, MARGIN = 2, FUN = sd), 2)

pos$state
sumny
  • 66
  • 2
  • I am curious why the `+ 0`? – missuse Oct 30 '20 at 20:09
  • This seems to work for this specific `po`, but what about others. – Nip Oct 30 '20 at 20:21
  • Just added the `+ 0` to indicate where the outcome mean specification would go and you could e.g., do `scale(x)[, 1] * 2 + 10` to get an outcome with mean 10 and a root-mean-square of 2. – sumny Nov 01 '20 at 10:53
  • 1
    Regarding other `PipeOp`s, as I said, the `$state$ of a `PipeOp` cannot be fixed and this is typically meaningful as the workflow of ML pipelines in general is like the following: Perform the operation on the training data and store learned information in the `$state`. Then during the prediction rely on the `$state` that was learned on the training data (prevents information leakage) to carry out the operation on the prediction data. If you have other very specific scenarios in mind you can always write your own `PipeOp`s that e.g. inherit from `PipeOpTaskPreproc`/`PipeOpTaskPreprocSimple`. – sumny Nov 01 '20 at 10:58
  • I didn't know about creating a custom `PO`. I have to check that out. Also, seems like fixing the `$state` of a `PO` hasn't been implemented yet (https://github.com/mlr-org/mlr3pipelines/issues/537) – Nip Nov 05 '20 at 04:25