How to compute a similarity between two vectors with heterogeneous attributes

Question

I have an optimization problem where I have a set of providers P selling objects Op of different types with different performance vectors Pv=[p1, p2, p3, ..., pn]and a set of client requests R asking for objects Or with an expected performance vectors Er=[e1, e2, ..., en].

I would like to compute what are the provider's objects that are close enough to the ones requested by clients given the performance vectors, I have looked at some measures like : Euclidian squarred distance but I am not sure how to use it since the units of the performance vectors are different i.e p1 is measured in seconds, p2 is measured in dollars and so on...

Could anyone shed some light and suggest a methodology ?

Scale all features to be between 0 and 1, with a similar standard deviation, and yes, Euclidian distance is a good first start. — Matthieu Brucher, Nov 12 '18 at 14:14
The features are different and have different values, how do i can scale them with similar standard deviation since they are heterogeneous? — user2567806, Nov 12 '18 at 14:17
You have a distribution for each of the features, scale them based on these dsitributions.. — Matthieu Brucher, Nov 12 '18 at 14:18
You have a distribution for p1, with a mean and a standard deviation, transform the entries for p1 by removing the mean and dividing by the standard deviation. — Matthieu Brucher, Nov 12 '18 at 14:23
So you mean I need to take all the p1's among all the objects I guess ? and do the same process for all other features ? — user2567806, Nov 12 '18 at 14:25
Yes, exactly. Then all features will be comparable. I'll write an answer with these. — Matthieu Brucher, Nov 12 '18 at 14:31

score 1 · Accepted Answer · answered Nov 12 '18 at 14:35

The first idea you should try is to scale each of your features independently before comparing them.

For instance, get all your p1 samples, compute mean and standard deviation, then transform your samples to (s - mean)/std. Do this for each of your features, except for those that are already binary (0/1).

Then you can use Euclidian distance as a first trial for analyze if the points are far or not.

Similarity measures are something different, yet similar, you can use something like e^(-distance(x, y)) to get a similarity between 0 and 1, and there are other measures that could try as well. You should use these on the scaled data, not the original one.

How to compute a similarity between two vectors with heterogeneous attributes

1 Answers1