0

I'm working on a Java project where I need to match user queries against several engines. Each engine has a method similarity(Object a, Object b) which returns: +1 if the objects surely match; -1 if the objects surely DON'T match; any float in-between when there's uncertainty.

Example: user searches "Dragon Ball".

  • Engine 1 returns "Dragon Ball", "Dragon Ball GT", "Dragon Ball Z", and it claims they are DIFFERENT result (similarity=-1), no matter how similar their names look. This engine is accurate, so it has a high "weight" value.
  • Engine 2 returns 100 different results. Some of them relate to DBZ, others to DBGT, etc. The engine claims they're all "quite similar" (similarity between 0.5 and 1).
  • The system queries several other engines (10+)

I'm looking for a way to build clusters out of this system. I need to ensure that values with similarity near -1 will likely end up in different clusters, even if many other values are very similar to all of them.

Is there a well-known clustering algorithm to solve this problem? Is there a Java implementation available? Can I build it on my own, perhaps with the help of a support library? I'm good at Java (15+ years experience) but I'm completely new at clustering.

Thank you!

agdev84
  • 151
  • 1
  • 5

1 Answers1

0

The obvious approach would be to use "1 - similarity" as a distance function, which will thus go from 0 to 2. Then add them up.

Or you could use 1 + similarity and take the product of these values, ... or, or, or, ...

But since you apparently trust the first score more, you may also want to increase its influence. There is no mathematical solution for this, you habe to choose the weights depending on your data and preferences. If you have training data, you can optimize weights for your approach, and you may want to even discard some rankers if they don't work well or are correlated.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • I can convert my similarity function however I want, that's not the issue. My problem is: what algorithm/library should I feed my similarities/distances to, so that I can obtain clusters from them? With "clusters" I mean "explicit arrays of related nodes". – agdev84 Nov 11 '16 at 13:19
  • Library recommendations are off-topic for StackOverflow. You can find some with Google easily, though. Note that they usually expext a pairwise distance matrix, where you only have *one* entry per pair of objects. – Has QUIT--Anony-Mousse Nov 11 '16 at 19:14