-2

How can I calculate similarity between user and score?

For example, df:

    user    score   category_cluster
    i       4.5     category1
    j       5       category1
    k       9.5     category2

I want to have a result like:

similarity between useri_j score in the same category_cluster if not in the same cluster do not compute similarity. How would you measure the similarity?

  • You have to choose the similarity based on your application. How did you derive 0.9 as the similarity for i and j? Are you looking for a similarity function of some sort? For multiple dimensions, there is a common "cosine similarity" that makes a good starting point. However, your example is 1-dimensional. – Prune Feb 18 '16 at 19:34
  • Yes, I'm looking for a similarity function like this, [0,1] range. Indeed, 1-dimensional, so I'm having difficulty :( – PineapplePizza Feb 18 '16 at 19:43

1 Answers1

0

You will need to define a score function first. Among others, you have manhattan or euclidean distances, which are the probably the most used ones. For more information about distances, I suggest you looking into scikit-learn, they hae a wide variety of distances (metrics) implemented. Look here for a list (you can research later what each of them measure).

Some of them are distance metrics (how different the elements are, the closest to 0 the more similar) while others measure similarity (like exponential kernels, closer to 1 more similar). Is easy to swap between distance and similarity metrics (being the most basic one distance = 1. - similarity assuming both are in the [0,1] range).

As for your similarity example similarity[i,j] = 0.9 doesn't make any sense to me. What would be the similarity of i and k? Which formula did you use to get that 0.9? If you clarify it I could provide you with a numpy based representation.

For direct similarity metrics, have a look here. You can use any of them if they suit your needs. It is explained what each of those measure.


A example usage of rbf_kernel.

data = df['score']
similarity = rbf_kernel(data.reshape(-1, 1), gamma=1.) # Try different values of gamma

gamma here acts like a threshold different values of gamma will make being similar less or more cheap.

Imanol Luengo
  • 15,366
  • 2
  • 49
  • 67
  • By `similarity[i,j] = 0.9` I meant, item i (which has score 4.5) and item j (which has score 5) are very similar in [0,1] range. – PineapplePizza Feb 18 '16 at 20:28
  • Thank you for the links! – PineapplePizza Feb 18 '16 at 20:30
  • @Silvia07 Yep, but why not `0.8` or `0.97`, how did you get that `0.9`? I would suggest something like [rbf_kernel](http://scikit-learn.org/stable/modules/metrics.html#rbf-kernel), probably one of the most used similarity metrics. – Imanol Luengo Feb 18 '16 at 21:02
  • it was just an example :) I removed this example. thanks, i'll look into rbf_kernel then. – PineapplePizza Feb 18 '16 at 21:06
  • @Silvia07 Oh ok, thought u had some kind of measurement in mind, I'll edit the post with a example of how to use the `rbf_kernel`. – Imanol Luengo Feb 18 '16 at 21:34
  • Thank you! It's working! :) Is it possible to normalize this similarity? – PineapplePizza Feb 18 '16 at 22:10
  • @Silvia07 it is already normalized. Most similar items (each one compared to themselves) have a score of `1`. Different values of `gamma` will give you different *normalizations* (different scores for the rest of the values). – Imanol Luengo Feb 18 '16 at 22:36