We're using the EuclideanDistanceSimilarity class to calculate the similarity of a bunch of items using Hadoop.
Unfortunately some items are getting zero or very few resulting similar items despite being highly similar to items.
I think I've tracked it down to this line in the EuclideanDistanceSimilarity class:
double euclideanDistance = Math.sqrt(normA - 2 * dots + normB);
The value passed to sqrt is sometimes negative, in which case NaN is returned. I figure perhaps there should be a Math.abs in there somewhere but my maths aren't strong enough to understand how the Euclidean calculation has been rearranged so not sure what the effect would be.
Can anyone explain the maths any better and confirm whether
double euclideanDistance = Math.sqrt(Math.abs(normA - 2 * dots + normB));
would be an acceptable fix?