4

We're using the EuclideanDistanceSimilarity class to calculate the similarity of a bunch of items using Hadoop.

Unfortunately some items are getting zero or very few resulting similar items despite being highly similar to items.

I think I've tracked it down to this line in the EuclideanDistanceSimilarity class:

double euclideanDistance = Math.sqrt(normA - 2 * dots + normB);

The value passed to sqrt is sometimes negative, in which case NaN is returned. I figure perhaps there should be a Math.abs in there somewhere but my maths aren't strong enough to understand how the Euclidean calculation has been rearranged so not sure what the effect would be.

Can anyone explain the maths any better and confirm whether

double euclideanDistance = Math.sqrt(Math.abs(normA - 2 * dots + normB));

would be an acceptable fix?

Tom Martin
  • 2,498
  • 3
  • 29
  • 37
  • Interesting implementation, normally euclidian similarities are like `(1-sqrt(sum((vector2-vector1)^2))` – Thomas Jungblut Oct 26 '12 at 15:06
  • Where in the code does that line occur? – Fred Foo Oct 26 '12 at 15:17
  • @ThomasJungblut Yeah, I don't totally understand, I guess it's because the calculation's been distributed. I can tell you that in this case the normA and normB are the sum of the squares of each vector whilst the dots is possibly the sum dot product but I'm unsure. – Tom Martin Oct 26 '12 at 15:32
  • @larsmans In the SimiliarityReducer it calls `similarity.similarity`. That line is in EuclideanDistanceSimilarity which is one of the VectorSimilarityMeasures that can be plugged in via the configuration of the Hadoop job. – Tom Martin Oct 26 '12 at 15:35

1 Answers1

5

The code is in org.apache.mahout.math.hadoop.similarity.cooccurrence.measures. EuclideanDistanceSimilarity.

Yes it's written in this way because at that point in the computation it has the norms of vectors A and B, and their dot product, so it's much faster to compute the distance that way.

The identity is pretty easy. Let C = A - B and let a, b and c be the lengths of the corresponding vectors. We need c. From the law of cosines, c2 = a2 + b2 - 2ab·cos(θ), and ab·cos(θ) is just the value of the dot product. Note that normA in the code is actually the square of the norm (length) -- really should have been better named.

Back to the question: you are right there's a bug here, in that rounding can make the argument negative. The fix isn't abs(), but:

double euclideanDistance = Math.sqrt(Math.max(0.0, normA - 2 * dots + normB));

It just needs to be capped to 0. I can commit that.

Sean Owen
  • 66,182
  • 23
  • 141
  • 173
  • FYI I tried it with a patched version of the EuclideanDistanceSimilarity and it seemed to do the trick. Thanks Sean. – Tom Martin Oct 30 '12 at 17:22