0

I am trying to detect dense subspaces from a high dimensional dataset. For this I want to use ELKI library. But there are very few documentations and examples of ELKI library.

I tried the following-

    Database db=makeSimpleDatabase("D:/sample.csv", 600);

    ListParameterization params = new ListParameterization();
    params.addParameter(CLIQUE.TAU_ID, "0.1");
    params.addParameter(CLIQUE.XSI_ID, 20);

    // setup algorithm
    CLIQUE<DoubleVector> clique = ClassGenericsUtil.parameterizeOrAbort(CLIQUE.class, params);

    // run CLIQUE on database
    Clustering<SubspaceModel<DoubleVector>> result = clique.run(db);

    for(Cluster<?> cl : result.getToplevelClusters()) {
        System.out.println(cl.getIDs());
    }

I gave the following input-

2,2
2,3
5,2
5,3
8,4

and the result was-

[2, 1]
[4, 3]
[5]
[3, 1]
[4, 2]
[5]
[1]
[2]
[3]
[4]
[5]

I expect the output as input datapoints grouped into subspaces. May be I am picking the wrong values or setting the parameters in a wrong way.

Please help. Thanks in advance.

  • "didnt worked" is not very precise. What happened? What did you expect to happen? Does it work in the MiniGUI? I use mostly the MiniGUI. – Has QUIT--Anony-Mousse May 05 '15 at 23:37
  • Thanks for the reply & sorry for imprecise question. I have edited the question. – Shantanu Mahakale May 06 '15 at 15:36
  • I never got really good results with CLIQUE. I think it only works for synthetic data. Also, it probably only works for continuous data, high-dimensional data, and larger data sets. It's really based on the concept that some dimensions are uniform noise if I recall correctly. But I don't think it is a method worth trying - it really never worked on my data sets. – Has QUIT--Anony-Mousse May 06 '15 at 18:48

1 Answers1

2

Note that CLIQUE produces overlapping clusters.

Elements can be in 0 to many clusters at the same time. If you choose your parameters badly (and CLIQUE parameters seem to be really hard to choose), you will get weird results. In your case, it seems to be 11 clusters, despite your data set only having 5 elements.

Essentially what the clustering tells you is:

Elements [2,1] cluster (they both have x=2)

Elements [4,3] cluster (they both have x=5)

Element [5] is a cluster (only element with x=8)

Elements [3,1] cluster (they both have y=2)

Elements [4,2] cluster (they both have y=3)

Element [5] is a cluster (only element with y=4)

In the x,y subspace, every element is separate, and its own cluster.

Choose better parameters for this fragile algorithm.

TAU = 0.1 (10% of 5 points): anything with more than 0.5 points is a cluster... in other words, everything. That is why you get this result - you asked for it.

Erich Schubert
  • 8,575
  • 2
  • 26
  • 42