Java clustering algorithm to handle both similarity and dissimilarity

Question

I'm working on a Java project where I need to match user queries against several engines. Each engine has a method similarity(Object a, Object b) which returns: +1 if the objects surely match; -1 if the objects surely DON'T match; any float in-between when there's uncertainty.

Example: user searches "Dragon Ball".

Engine 1 returns "Dragon Ball", "Dragon Ball GT", "Dragon Ball Z", and it claims they are DIFFERENT result (similarity=-1), no matter how similar their names look. This engine is accurate, so it has a high "weight" value.
Engine 2 returns 100 different results. Some of them relate to DBZ, others to DBGT, etc. The engine claims they're all "quite similar" (similarity between 0.5 and 1).
The system queries several other engines (10+)

I'm looking for a way to build clusters out of this system. I need to ensure that values with similarity near -1 will likely end up in different clusters, even if many other values are very similar to all of them.

Is there a well-known clustering algorithm to solve this problem? Is there a Java implementation available? Can I build it on my own, perhaps with the help of a support library? I'm good at Java (15+ years experience) but I'm completely new at clustering.

Thank you!

Are the answers [here](http://stackoverflow.com/questions/2129269/java-clustering-library) not helpful? — Ironcache, Nov 10 '16 at 17:21
I think your question is too broad ... but lets say what others think. — GhostCat, Nov 10 '16 at 17:24

score 0 · Answer 1 · answered Nov 10 '16 at 21:07

0

The obvious approach would be to use "1 - similarity" as a distance function, which will thus go from 0 to 2. Then add them up.

Or you could use 1 + similarity and take the product of these values, ... or, or, or, ...

But since you apparently trust the first score more, you may also want to increase its influence. There is no mathematical solution for this, you habe to choose the weights depending on your data and preferences. If you have training data, you can optimize weights for your approach, and you may want to even discard some rankers if they don't work well or are correlated.

answered Nov 10 '16 at 21:07

Has QUIT--Anony-Mousse

76,138
12
138
194

I can convert my similarity function however I want, that's not the issue. My problem is: what algorithm/library should I feed my similarities/distances to, so that I can obtain clusters from them? With "clusters" I mean "explicit arrays of related nodes". – agdev84 Nov 11 '16 at 13:19
Library recommendations are off-topic for StackOverflow. You can find some with Google easily, though. Note that they usually expext a pairwise distance matrix, where you only have *one* entry per pair of objects. – Has QUIT--Anony-Mousse Nov 11 '16 at 19:14

Java clustering algorithm to handle both similarity and dissimilarity

1 Answers1