0

If I use a similarity based algorithm such as pearson correlation score to compare two feature vectors and I want to know those dimensions/feature fields which are very much dissimilar amongst the feature set then what is the algorithm to be used? I am using Mahout which is a machine learning library for Java

seahorse
  • 2,420
  • 4
  • 31
  • 40

1 Answers1

1

Well, it would just be the dimension in which the two vectors differed most -- in which the absolute value of the difference of the vectors' values in the dimension was largest. Is that really all you mean or are you looking for something subtler?

Sean Owen
  • 66,182
  • 23
  • 141
  • 173
  • Ok say I have fv1, fv2, fv3, fv4 and fv5 as the feature vectors which are supposed to be very "similar". Now for feature vector 2 = fv2(say) I need to find which dimensions are awkward or have a large variation of disimilarity as compared to the other dimensions. For this I want to compare fv2 with all other feature vectors and then come up with the answer.So I need to calculate average absolute difference across all vectors or is there some better statistic? – seahorse Mar 13 '12 at 16:23
  • 1
    Absolute difference from the average is reasonable; I might suggest something more normalized like a z-value -- just the number of standard deviations from the mean the value is. – Sean Owen Mar 13 '12 at 16:37