0

I am trying to integrate the MiniBatchKmeans function of package ClusterR to mlr. As per the docs, I have made the following changes:

  1. Created makeRLearner.cluster.MiniBatchKmeans
  2. Created trainLearner.cluster.MiniBatchKmeans
  3. Created predictLearner.cluster.MiniBatchKmeans
  4. Registered the above S3 methods (as described here)

At this point, I am able to create the learner, and call train and predict on them. However, the problem occurs when trying to create the learner without any value of "clusters" provided.

The underlying package (in ClusterR) does not have a default value defined for argument "clusters". As per the mlr approach, I have attempted to provide a default value of "clusters" using par.vals argument. However, this default argument is ignored.

My code:

#' @export
makeRLearner.cluster.MiniBatchKmeans = function() {
  makeRLearnerCluster(
    cl = "cluster.MiniBatchKmeans",
    package = "ClusterR",
    par.set = makeParamSet(
      makeIntegerLearnerParam(id = "clusters", lower = 1L),
      makeIntegerLearnerParam(id = "batch_size", default = 10L, lower = 1L),
      makeIntegerLearnerParam(id = "num_init", default = 1L, lower = 1L),
      makeIntegerLearnerParam(id = "max_iters", default = 100L, lower = 1L),
      makeNumericLearnerParam(id = "init_fraction", default = 1, lower = 0),
      makeDiscreteLearnerParam(id = "initializer", default = "kmeans++",
        values = c("optimal_init", "quantile_init", "kmeans++", "random")),
      makeIntegerLearnerParam(id = "early_stop_iter", default = 10L, lower = 1L),
      makeLogicalLearnerParam(id = "verbose", default = FALSE,
        tunable = FALSE),
      makeUntypedLearnerParam(id = "CENTROIDS", default = NULL),
      makeNumericLearnerParam(id = "tol", default = 1e-04, lower = 0),
      makeNumericLearnerParam(id = "tol_optimal_init", default = 0.3, lower = 0),
      makeIntegerLearnerParam(id = "seed", default = 1L)
    ),
    par.vals = list(clusters = 2L),
    properties = c("numerics", "prob"),
    name = "MiniBatchKmeans",
    note = "Note",
    short.name = "MBatchKmeans",
    callees = c("MiniBatchKmeans", "predict_MBatchKMeans")
  )
}

#' @export
trainLearner.cluster.MiniBatchKmeans = function(.learner, .task, .subset, .weights = NULL, ...) {
  ClusterR::MiniBatchKmeans(getTaskData(.task, .subset), ...)
}

#' @export
predictLearner.cluster.MiniBatchKmeans = function(.learner, .model, .newdata, ...) {
  if (.learner$predict.type == "prob") {
    pred = ClusterR::predict_MBatchKMeans(data = .newdata,
      CENTROIDS = .model$learner.model$centroids,
      fuzzy = TRUE, ...)

    res = pred$fuzzy_clusters

    return(res)
  } else {
    pred = ClusterR::predict_MBatchKMeans(data = .newdata,
      CENTROIDS = .model$learner.model$centroids,
      fuzzy = FALSE, ...)

    res = as.integer(pred)

    return(res)
  }
}

The problem (default value of clusters in par.vals above is ignored):

## When defining a value of clusters, it works as expected
lrn <- makeLearner("cluster.MiniBatchKmeans", clusters = 3L)
getLearnerParVals(lrn)
# The below commented lines are printed
# $clusters
# [1] 3

## When not providing a value for clusters, default is not used
lrn <- makeLearner("cluster.MiniBatchKmeans")
getLearnerParVals(lrn)
# The below commented lines are printed
# named list()

Any advice on why I am seeing this behavior? I checked other learner's (like cluster.kmeans, cluster.kkmeans etc) code and I see that they are able to successfully define default values in the same format that I have done. Additionally, here is documentation that this is the right way to go.

Here is my code on github, in case it's helpful for reproducing the problem. There is an added test file (in tests/testthat), but that has issues of its own.

Edit 1 - Actual Error Message Here is the actual error message that I see when trying to train a learner without explicitly providing default value of "clusters":

lrn <- makeLearner("cluster.MiniBatchKmeans")
train(lrn, cluster_task)
 Error in ClusterR::MiniBatchKmeans(getTaskData(.task, .subset), ...) : 
  argument "clusters" is missing, with no default 
10.
ClusterR::MiniBatchKmeans(getTaskData(.task, .subset), ...) at RLearner_cluster_MiniBatchKmeans.R#32
9.
trainLearner.cluster.MiniBatchKmeans(.learner = structure(list(
    id = "cluster.MiniBatchKmeans", type = "cluster", package = "ClusterR", 
    properties = c("numerics", "prob"), par.set = structure(list(
        pars = list(clusters = structure(list(id = "clusters",  ... at trainLearner.R#24
8.
(function (.learner, .task, .subset, .weights = NULL, ...) 
{
    UseMethod("trainLearner")
})(.learner = structure(list(id = "cluster.MiniBatchKmeans",  ... 
7.
do.call(trainLearner, pars) at train.R#96
6.
fun3(do.call(trainLearner, pars)) at train.R#96
5.
fun2(fun3(do.call(trainLearner, pars))) at train.R#96
4.
fun1({
    learner.model = fun2(fun3(do.call(trainLearner, pars)))
}) at train.R#96
3.
force(expr) at helpers.R#93
2.
measureTime(fun1({
    learner.model = fun2(fun3(do.call(trainLearner, pars)))
})) at train.R#96
1.
train(lrn, cluster_task) 
prasiddhi
  • 1
  • 1

1 Answers1

0

The code in your repository works for me -- are you actually getting an error when you run it? The way that you've encoded the default is really more of an override and not a default. You probably want to do

makeIntegerLearnerParam(id = "clusters", lower = 1L, default = 2L),

and remove the par.vals.

Lars Kotthoff
  • 107,425
  • 16
  • 204
  • 204
  • Yes, I get an error when trying to call train on this learner (without clusters argument explicitly provided). I have updated the error traceback in my question details. – prasiddhi Mar 09 '19 at 04:47
  • Works for me without error: `> train(lrn, makeClusterTask(id = "foo", iris[,-5])) Model for learner.id=cluster.MiniBatchKmeans; learner.class=cluster.MiniBatchKmeans Trained on: task.id = foo; obs = 150; features = 4 Hyperparameters: clusters=2` – Lars Kotthoff Mar 09 '19 at 04:50
  • Ok, this is driving me crazy. I took a fresh copy and it worked for me. I tried to make the code change you suggested in your answer and it stopped working (same error). Is it somehow related to the registerS3Method call? – prasiddhi Mar 09 '19 at 10:20
  • It could be, depending on how exactly you're testing it. The safest way is to start a new R session each time. – Lars Kotthoff Mar 09 '19 at 21:32
  • As in the package there is no default for the argument `clusters` it is definitely correct to put a value in `par.vals`. The default in the parameter description is just "cosmetic". While developing use `devtools:load_all()` to avoid problems (no guarantee). – jakob-r Mar 11 '19 at 16:13
  • @jakob-r Thank you, I could get it to work. I came across the devtools-related explanation based on http://r-pkgs.had.co.nz/package.html, which I took up because I thought this was related to namespace shenanigans. – prasiddhi Mar 12 '19 at 01:17