0

I want to compare users based on responses to 10 questions. My original idea was to resolve each question to an integer [1, 5], but this idea won't work all the time. For example:

vec1 = [1,1,1,1,1,1,1,1,1,1]

vec2 = [5,5,5,5,5,5,5,5,5,5]

get_cos_sim(vec1, vec2) = 1

So even though the users responded completely dissimilarly, the vectors are the same.

I would like to get similar users based on similarity of their responses to each question. So for a given question, if person A's response resolved to 1 and person B's response resolved to 2, the similarity between the responses in those questions would be higher than person A's and person C's response, who answered 4.

Jeremy Fisher
  • 2,510
  • 7
  • 30
  • 59

1 Answers1

0

Here's the metric I would use:

Take the absolute value of the difference between each answer, sum all of those values, the similarity is the inverse.

OregonTrail
  • 8,594
  • 7
  • 43
  • 58
  • I thought about doing that but the problem is: take a 2 element vector `[0, 5]` and `[5, 0]`. Then the algo would return 10 for each vector, although the users answered the questions fundamentally differently. I was thinking euclidean distance. – Jeremy Fisher Aug 18 '17 at 04:16
  • Right, take the inverse. Similarity for that vector would be 0.1 – OregonTrail Aug 18 '17 at 04:19
  • Similarity for [5, 0], [5, 0] would be infinity. Similarity for [3, 1], [5, 2] would be 0.333. – OregonTrail Aug 18 '17 at 04:21
  • So as the metric approaches infinity the more similar and the closer to 0 the least similar? – Jeremy Fisher Aug 18 '17 at 04:23
  • Well, the largest positive non-infinite result is being "1 off". Like [4, 5], [5, 5] is 1/1 which is 1. So the domain of the metric is really from just above 0 to 1 and you can clamp divide by zero at 1 – OregonTrail Aug 18 '17 at 04:27