4

My objective is to calculate the degree of similarity between two users based on their attributes. For instance let's consider a player and consider age, salary, and points as attributes.

Also I want to place weight on each attribute by order of importance. In my case age is a more important attribute than salary and points. So for instance let's assume we calculate the similarity using the euclidean distance.

Given user 1 who is age 20, salary 50, points scored 100

Given user 2 who is age 24, salary 60, points scored 85

Given user 3 who is age 19, salary 62, points scored 80

To compute the similarity between user 1 and user 2 I could do

sqrt of( (20-24)^2 + (60-50)^2 + (85-100)^2 )

Now we want to also add the weights so in euclidean distance the lower the number the more closer two objects are in terms of similaraity. As mentioned earlier since age is the most important so we will assign weights as follows

sqrt of( 0.60*(20-24)^2 + 0.20*(60-50)^2 + 0.20*(85-100)^2 )

Is my approach correct ? Also should i be considering other algorithms such as cosine similarity to calculate similarity?

user1010101
  • 2,062
  • 7
  • 47
  • 76
  • Correct with respect to what? This approach is at least reasonable. What other similarity measures you want to use depends solely on your application. And you will probably need to test a few to find which one works best. – Nico Schertler Nov 02 '16 at 15:07
  • @NicoSchertler I was not sure if i was adding the weights correctly and also i was wondering if there are other algorithms that would calculate similarity between two users more accurately. For instance age is the most important factor for my application – user1010101 Nov 02 '16 at 15:09
  • The weighting looks good to me. Of course, you should be aware of the ranges of the attributes. If they are different, then you may want to introduce some normalization. There are a whole lot of other similarity measures. [Wolfram](https://reference.wolfram.com/language/guide/DistanceAndSimilarityMeasures.html) lists some of them (see Numerical Data). – Nico Schertler Nov 02 '16 at 15:12
  • @NicoSchertler good point, i would definitely make sure attributes are normalized. Thank you that is a good resource. If you can list your comment as an answer i can accept it. – user1010101 Nov 02 '16 at 15:16
  • It's not really an answer, so I'll leave it as a comment. If you finish your research, you may post an answer yourself, describing the method that worked best for your case. – Nico Schertler Nov 02 '16 at 15:19

1 Answers1

1

I am currently working on a project that involves calculating measurements between different entities so I am familiar with your problem.

In your case good thing is that you don't have features of various , mixed types (e.g. text or categorical etc..) . Age ,salary and points are numbers and as already mentioned in the comments the first thing you should do is normalization. It's a "must do" because if you don't do it then there is a danger that one feature will be dominant when calculating distance.

You have to be careful and check your data and clean if necessary. e.g. bad value where age is 200 will mess up your normalization and majority of scaled age values will end up in the lower part (closer to zero).

You are right regarding weight and calculating the weighted euclidean. These weights have sum value of 1 (as you have showed in the example 0.6+0.2+0.2 = 1 ).

Regarding which distance metrics to use it's a good question. There are bunch of them. e.g. check https://docs.scipy.org/doc/scipy/reference/spatial.distance.html

But based on my experience I would choose euclidean although you should try few and check how it works on your data.

milos.ai
  • 3,882
  • 7
  • 31
  • 33