Clustering using Java-ML package

Question

I have a dataset with each instance having a single attribute value, and need to apply clustering on it. Java-ML (Java Machine Learning Library) seemed suitable to me for this task. But I found that the class "Dataset" in it is structured as a set of instances which is structured as a set of attributes and a class label. My problem is that I have a single attribute for each instance and no class label.

Here is a sample code that I tried and unexpectedly the execution doesn't end once it starts clustering.

    int k;
    Dataset dataset = new DefaultDataset();
    double[] val= {5,6,15,20,40,50,55,73};
    for(int i = 0; i < val.length; i++) {
        Instance instance= new SparseInstance(1);
        instance.put(1, val[i]);
        dataset.add(instance);
    }
    k = 3;
    Clusterer km = new KMeans(k);
    System.out.println(dataset);
    Dataset[] clusters = km.cluster(dataset);
    System.out.println(dataset);
    for(int i = 0; i < k; i++) {
        System.out.println(clusters[i]+"\n\n\n\n");
    }

I am unable to understand the reason behind such an unexpected behavior. Is there any other library that suits my work more than Java-ML?

Thanks in advance.

score 2 · Accepted Answer · answered Jun 30 '13 at 09:36

2

First of all, as your data is 1 dimensional, don't use clustering in the first place.

1-dimensional data can be sorted, which allows for much faster algorithms than for the general case. You may want to look into classic statistics, natural breaks, kernel density estimation etc. In fact, I'd start with kernel density estimation and split the data on the lowest minimum between of two local maxima.

Now for Java-ML, what you say indicates that it is actually a classification package. The need for class labels is typical for applications driven with classification in mind. There it is essentially to have a class label to learning and validation.

I've mostly used ELKI which has a huge choice of clustering algorithms, and does not expect the data to be labeled.

answered Jun 30 '13 at 09:36

Has QUIT--Anony-Mousse

76,138
12
138
194

What I want is to group the 1-dimensional data into subsets which are homogeneous within and heterogeneous to others. The number of subsets isn't fixed and has to be set based on the data. Upon your suggestion of natural break and kernel density estimation, I looked into them and it seems Jenks natural break suits this task. The problem with this method is the heavy iterative computations to identify the breaks. However k-means too suffers from the problem of fixing the value of k. – Tarique Jul 01 '13 at 18:57
I would like you to have a look into this link (http://stackoverflow.com/questions/5304057/partition-into-classes-jenks-vs-kmeans), where the person is getting results much faster using k-means than natural breaks. Can you please comment something on the comparison between natural breaks and k-means? – Tarique Jul 01 '13 at 18:57
I don't use 1d data, so I don't have first hand experiences. Either way, k-means doesn't make much sense in 1-d, because it doesn't exploit the order of the data. As for the observed effect, it may well be an implementation issue. K-means in R is a C function, so it is reasonably fast. Maybe the other one is pure R, which can be substantially slower. – Has QUIT--Anony-Mousse Jul 01 '13 at 21:22

score 1 · Answer 2 · answered Jun 30 '13 at 00:16

1

If all you have is one feature value, there is very little reason to use any clustering algorithms. Just plotting with a histogram or KDE should be more than sufficient to find what information you are looking for.

answered Jun 30 '13 at 00:16

Raff.Edward

6,404
24
34

Thanks for the reply @Edward. What I want is to group the 1-dimensional data into subsets which are homogeneous within and heterogeneous to others. The number of subsets isn't fixed and has to be set based on the data. Upon your suggestion of natural break and kernel density estimation, I looked into them and it seems Jenks natural break suits this task. The problem with this method is the heavy iterative computations to identify the breaks. However k-means too suffers from the problem of fixing the value of k. – Tarique Jul 01 '13 at 19:06
I would like you to have a look into this link (stackoverflow.com/questions/5304057/…), where the person is getting results much faster using k-means than natural breaks. Can you please comment something on the comparison between natural breaks and k-means? – Tarique Jul 01 '13 at 19:07
It seems you need to do your own research and study. Instead of asking your goal, you asked how to use a tool to accomplish your goal. It was clear you shouldn't be using that tool Now that you have a concise and clear goal, you need to try and solve your problem or search for already existing solutions. If there are none, you must solve your problem. If you can not solve your problem, you must determine if you lack the required knowledge or if the problem is not solvable. If the former, you may chose to work harder to gain that knowledge. Otherwise - you may need to purchase outside help. – Raff.Edward Jul 02 '13 at 04:13

Clustering using Java-ML package

2 Answers2