2

In the mlr package, I can perform a clustering. Let´s say I don´t want to know how the model performs on unseen data, but I just want to know what the best number of clusters are regarding a given performance measure.

In this example, I use the moons data set of the dbscan package.

library(mlr)
library(dbscan)
data("moons")

db_task = makeClusterTask(data = moons)

db = makeLearner("cluster.dbscan")

ps = makeParamSet(makeDiscreteParam("eps", values = seq(0.1, 1, by = 0.1)),
  makeIntegerParam("MinPts", lower = 1, upper = 5))

ctrl = makeTuneControlGrid()

rdesc = makeResampleDesc("CV", iters = 3) # I don´t want to use it, but I have to 

res = tuneParams(db, 
  task = db_task, 
  control = ctrl,
  measures = silhouette, 
  resampling = rdesc, 
  par.set = ps)
#> [Tune] Started tuning learner cluster.dbscan for parameter set:
#>            Type len Def                                Constr Req Tunable
#> eps    discrete   -   - 0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1   -    TRUE
#> MinPts  integer   -   -                                1 to 5   -    TRUE
#>        Trafo
#> eps        -
#> MinPts     -
#> With control class: TuneControlGrid
#> Imputation value: Inf
#> [Tune-x] 1: eps=0.1; MinPts=1
#> Error in matrix(nrow = k, ncol = ncol(x)): invalid 'nrow' value (too large or NA)

Created on 2019-06-06 by the reprex package (v0.3.0)

However, mlr forces me to use a resampling strategy. Any idea of how to use mlr in cluster tasks without resampling?

Banjo
  • 1,191
  • 1
  • 11
  • 28
  • Your code does not run for me (see the inserted reprex above). Why don't you take a look at the number of clusters calculated by the best performing model during tuning? – pat-s Jun 06 '19 at 14:20
  • I don't understand why it doesn´t work (I leave it in there until I understand the reason). I had some data sets where the results from the CV and the silhouette plot were different. – Banjo Jun 06 '19 at 15:36

1 Answers1

1

mlr is pretty poor when it comes to clustering. It's dbscan function is a wrapper around the very slow fpc package. Others wrap Weka, which is also very slow.

Use the dbscan package instead.

However, parameter tuning doesn't just work in unsupervised settings. You don't have labels, so you only have unreliable "internal" heuristics instead. And most of these are not reliable for DBSCAN because they will assume noise is a cluster, but it isn't. Few tools have support for noise in evaluation (I've seen options for this in ELKI), and I'm not convinced that either of the variants to handle noise is good. You can construct undesirable cases for each variant IMHO. You probably need to use at least two measures in the evaluation of clustering with noise.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • I tried it with the `dbscan` package already. The problem is, that I have to set the number of clusters beforehand and then make the knn plot. But they are not clearly visible (i.e. by a PCA) in my case. So, tuning based on a metric seemed to be one solution. – Banjo Jun 07 '19 at 08:56
  • 1
    If you want to predefine the number of clusters, k-means seems more appropriate as it enforces such a structure. Nevertheless, consider OPTICS and HDBSCAN* then instead of a grid search on DBSCAN. That is why OPTICS was invented in the first place, to not need to choose epsilon. – Has QUIT--Anony-Mousse Jun 07 '19 at 16:52