Python: finding score similarity between users within a cluster

Question

How can I calculate similarity between user and score?

For example, df:

    user    score   category_cluster
    i       4.5     category1
    j       5       category1
    k       9.5     category2

I want to have a result like:

similarity between useri_j score in the same category_cluster if not in the same cluster do not compute similarity. How would you measure the similarity?

You have to choose the similarity based on your application. How did you derive 0.9 as the similarity for i and j? Are you looking for a similarity function of some sort? For multiple dimensions, there is a common "cosine similarity" that makes a good starting point. However, your example is 1-dimensional. — Prune, Feb 18 '16 at 19:34
Yes, I'm looking for a similarity function like this, [0,1] range. Indeed, 1-dimensional, so I'm having difficulty :( — PineapplePizza, Feb 18 '16 at 19:43

Imanol Luengo · Accepted Answer · 2016-02-18T21:37:56.577

0

You will need to define a score function first. Among others, you have manhattan or euclidean distances, which are the probably the most used ones. For more information about distances, I suggest you looking into scikit-learn, they hae a wide variety of distances (metrics) implemented. Look here for a list (you can research later what each of them measure).

Some of them are distance metrics (how different the elements are, the closest to 0 the more similar) while others measure similarity (like exponential kernels, closer to 1 more similar). Is easy to swap between distance and similarity metrics (being the most basic one distance = 1. - similarity assuming both are in the [0,1] range).

As for your similarity example similarity[i,j] = 0.9 doesn't make any sense to me. What would be the similarity of i and k? Which formula did you use to get that 0.9? If you clarify it I could provide you with a numpy based representation.

For direct similarity metrics, have a look here. You can use any of them if they suit your needs. It is explained what each of those measure.

A example usage of rbf_kernel.

data = df['score']
similarity = rbf_kernel(data.reshape(-1, 1), gamma=1.) # Try different values of gamma

gamma here acts like a threshold different values of gamma will make being similar less or more cheap.

edited Feb 18 '16 at 21:37

answered Feb 18 '16 at 20:15

Imanol Luengo

15,366
2
49
67

By `similarity[i,j] = 0.9` I meant, item i (which has score 4.5) and item j (which has score 5) are very similar in [0,1] range. – PineapplePizza Feb 18 '16 at 20:28
Thank you for the links! – PineapplePizza Feb 18 '16 at 20:30
@Silvia07 Yep, but why not `0.8` or `0.97`, how did you get that `0.9`? I would suggest something like [rbf_kernel](http://scikit-learn.org/stable/modules/metrics.html#rbf-kernel), probably one of the most used similarity metrics. – Imanol Luengo Feb 18 '16 at 21:02
it was just an example :) I removed this example. thanks, i'll look into rbf_kernel then. – PineapplePizza Feb 18 '16 at 21:06
@Silvia07 Oh ok, thought u had some kind of measurement in mind, I'll edit the post with a example of how to use the `rbf_kernel`. – Imanol Luengo Feb 18 '16 at 21:34
Thank you! It's working! :) Is it possible to normalize this similarity? – PineapplePizza Feb 18 '16 at 22:10
@Silvia07 it is already normalized. Most similar items (each one compared to themselves) have a score of `1`. Different values of `gamma` will give you different *normalizations* (different scores for the rest of the values). – Imanol Luengo Feb 18 '16 at 22:36

Python: finding score similarity between users within a cluster

1 Answers1