2

I have a clustering problem that could be summarized this way:

  • i have N particles in a 3D spaces
  • each particle can interact with a different number of other particles
  • each interaction has a strength
  • i don't know the number of cluster a priori
  • i don't have leaning samples (should be unsupervised)

Output: i'd like to get:

  • the number of clusters
  • a probability for each particle to be part of a cluster (to be able to remove particles not clearly assigned)
  • i want to call the clusterer directly from my java code.

Question:

  • what clusterer would fit best to my problem?
  • how should i format my data?
  • should i use the 3D positioning information in complement to the interaction information?
  • how can i get the result for each particle?

I'm very new to weka, but from what i could find on the Internet:

  • SOM could solve my problem
  • it is a multi-instance problem but i could find any examples showing how to create relational data. and does SOM support relational attributes?

Thanks for your help. jeannot

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
jeannot
  • 1,255
  • 2
  • 9
  • 6
  • it is part of a bigger analysis problem i have, i'm trying to solve it this way, but i have difficulties to find tutorials or documentation for java programming with weka. – jeannot May 06 '12 at 13:37

2 Answers2

4

Weka is very "limited" when it comes to clustering. It has only very few clustering algorithms, and they are quite limited. I'm not sure if you could put in the interaction strength into any of the Weka clustering algorithms.

You might want to have a look at ELKI. It has much more advanced clustering algorithms than Weka, and they are very flexible. For example, you can easily define your own distance function (Tutorial) and use it in any distance-based clustering algorithm.

Choosing the appropriate clustering algorithm is nothing we can answer here. You need to try some and try different parameters. The key question you should try to answer first is: what is a useful cluster for you?

You have started to pose some of these questions. For example, whether you want to use interaction strength only, or whether to also include positional information. But as I do not know what you want to achieve, I can't tell you how.

Definitely have a look at the DBSCAN and OPTICS algorithms (in particular for OPTICS, don't use the one in Weka. It is slow, incomplete and unmaintained!). Maybe start reading their Wikipedia articles, if that makes any sense for your task. Here is why I believe they are helpful for you:

  • They do not need to know the number of clusters (unlike k-means and EM clustering)
  • They need a "minimum points" parameter, which is essentially a "minimum cluster size"; it controls how fine-grained the result becomes. Increase it to get fewer and larger clusters.
  • They can use arbitrary distance or similarity functions (for example, interaction strength). For DBSCAN you need to set a threshold to consider significant, for OPTICS this is not necessary.

Next I would probably use the interaction-strength data with OPTICS and try the Xi-extraction of clusters, if they make any sense for your use case. (Weka doesn't have the Xi extraction). Or maybe look at the OPTICS plot first, to see if your similarity and MinPts parameter actually produce the "valleys" you need for OPTICS. DBSCAN is faster, but you need to fix the distance threshold. If your data set is very large, you might want to start with OPTICS on a sample, then decide on a few epsilon-values and run DBSCAN on the full dataset with these values.

Still, start reading here to see if that makes sense for your task:

https://en.wikipedia.org/wiki/DBSCAN#Basic_idea

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • thanks for your answer. ELKI looks very interesting, i want to try it. i could'nt find programming example that callsa clusting algorithm from java code. could you provide me a simple one please? especially i didn't get how the input data are formatted to be given to the algorithm when they are not in a file. thanks – jeannot May 06 '12 at 20:04
  • To see how to invoke ELKI from Java, have a look at the unit tests included with ELKI. It is probably a bit more complicated than you expected because ELKI has support for index structures for acceleration - so it doesn't just operate on a data matrix. You can however use a `double[][]` matrix as input using the `ArrayAdapterDatabaseConnection` class. – Erich Schubert Jan 01 '13 at 11:47
0

If you have your data prepared according to ARFF file format of WEKA, then you can use the CLUSTER tab of WEKA explorer. This clusterizes your data (unsupervised) and also gives you the threshold for each feature value for each cluster. Very handy for unsupervised learning.

Rushdi Shams
  • 2,423
  • 19
  • 31
  • Hi, thanks for your answer, but i'd like to call weka (or any other framework) from java directly because i want the process to be fully automated. – jeannot May 10 '12 at 04:35
  • You should not have any problem with that. Just see Weka java API documentation on how to send data for clustering and how to get the answer. My main point is that clustering can do what you want to do here. You can either choose the GUI or you can use the java API. – Rushdi Shams May 10 '12 at 17:33
  • Yes it's clustering, but i need a custom distance function to be able to take in account the "interaction strength" between particles (not only their spatial positioning). Anony-Mousse seemed to say it's not possible with weka. maybe i'm missing something? – jeannot May 10 '12 at 21:48
  • Hmm, this as per my knowledge is not doable in Weka – Rushdi Shams May 11 '12 at 03:22