2

I'm trying to cluster image data (stored in 100 separate csv files) with ELKI's XMeans algorithm. It works well for the first two files, but then the algorithm hangs on forever while processing the third file. It looks like the problem occurs at every 3rd file or so, because when I start the loop, that goes over all files at the fourth file, it works for the fourth and the fifth file, but not for the sixth file. Same goes for the 9th and 11th file... but maybe that's coincidence.

My XMeans call looks like this:

    DatabaseConnection dbc = new ArrayAdapterDatabaseConnection(data);
    Database db = new StaticArrayDatabase(dbc, null);
    db.initialize();

    Relation<NumberVector> rel = db.getRelation(TypeUtil.NUMBER_VECTOR_FIELD);
    DBIDRange ids = (DBIDRange) rel.getDBIDs();

    SquaredEuclideanDistanceFunction dist = SquaredEuclideanDistanceFunction.STATIC;

    RandomlyGeneratedInitialMeans init = new RandomlyGeneratedInitialMeans(RandomFactory.DEFAULT);

    KMeansInitialization initializer = new FirstKInitialMeans();

    PredefinedInitialMeans splitInitializer = new PredefinedInitialMeans(data);
    KMeansQualityMeasure informationCriterion = new WithinClusterMeanDistanceQualityMeasure();
    RandomFactory random = new RandomFactory(123);
    KMeans<NumberVector, KMeansModel> innerKMeans = new KMeansHamerly<>(dist, 50, 1, init, true);

    XMeans<NumberVector, KMeansModel> xm = new XMeans<>(dist, 5, 50, 1, innerKMeans, initializer, splitInitializer, informationCriterion, random);

    Clustering<KMeansModel> c = xm.run(db, rel);

I'm not too sure about these four lines, so maybe that's why it works for some files and for others it doesn't:

KMeansInitialization initializer = new FirstKInitialMeans();

PredefinedInitialMeans splitInitializer = new PredefinedInitialMeans(data);
KMeansQualityMeasure informationCriterion = new WithinClusterMeanDistanceQualityMeasure();
RandomFactory random = new RandomFactory(123);

data is just a double[][] which contains the data from the input files.

Any help would be very appreciated!

Charlie28000
  • 67
  • 1
  • 5

1 Answers1

0

Please, use the Parameterization API to configure X-means.

Because of the nested k-means, it is very easy to configure things badly.

The initializer of the inner k-means class must be set to this:

PredefinedInitialMeans splitInitializer = new PredefinedInitialMeans((double[][]) null);

KMeans<NumberVector, KMeansModel> innerKMeans = new KMeansHamerly<>(dist, 50, 1, splitInitializer, true);

because otherwise X-means currently cannot control the initialization of the inner algorithm. I will remove this parameter, and have XMeans set the initializer of the inner algorithm.

Without a stack trace (as mentioned by @Anony-Mousse) it is hard to say what is happening. My best guess is that this meta-algorithm (an algorithm that runs another algorithm!) is not correctly configured and maybe chooses bad initialial values?

Erich Schubert
  • 8,575
  • 2
  • 26
  • 42
  • Thank you very much for your answer, I changed the code accordingly, but now I'm getting a Null Pointer Exception for the splitInitializer line: `Exception in thread "main" java.lang.NullPointerException at de.lmu.ifi.dbs.elki.algorithm.clustering.kmeans.initialization.PredefinedInitialMeans.setInitialMeans(PredefinedInitialMeans.java:109) at de.lmu.ifi.dbs.elki.algorithm.clustering.kmeans.initialization.PredefinedInitialMeans.(PredefinedInitialMeans.java:72)` – Charlie28000 Jun 28 '16 at 11:16
  • Oh, it seems to work now. I have changed the code to this: `PredefinedInitialMeans splitInitializer = new PredefinedInitialMeans(data); KMeans innerKMeans = new KMeansHamerly<>(dist, 50, 1, splitInitializer, true);` I hope this is okay. – Charlie28000 Jun 28 '16 at 11:23
  • Above code snipped (with the cast, and `null`) should work, as that is what `XMeans.Parameterizer` does at least in current git. Passing `data` does not make much sense here, as the predefined means are not your full data set. You could try `new double[0][0] {}`, too. – Erich Schubert Jun 28 '16 at 15:32