10

My feature vector has both continuous (or widely ranging) and binary components. If I simply use Euclidean distance, the continuous components will have a much greater impact:

Representing symmetric vs. asymmetric as 0 and 1 and some less important ratio ranging from 0 to 100, changing from symmetric to asymmetric has a tiny distance impact compared to changing the ratio by 25.

I can add more weight to the symmetry (by making it 0 or 100 for example), but is there a better way to do this?

skaffman
  • 398,947
  • 96
  • 818
  • 769
John Hall
  • 103
  • 4

3 Answers3

10

You could try using the normalized Euclidean distance, described, for example, at the end of the first section here.

It simply scales every feature (continuous or discrete) by its standard deviation. This is more robust than, say, scaling by the range (max-min) as suggested by another poster.

NPE
  • 486,780
  • 108
  • 951
  • 1,012
1

If i correctly understand your question, normalizing (aka 'rescaling) each dimension or column in the data set is the conventional technique for dealing with over-weighting dimensions, e.g.,

ev_scaled = (ev_raw - ev_min) / (ev_max - ev_min)

In R, for instance, you can write this function:

ev_scaled = function(x) {
    (x - min(x)) / (max(x) - min(x))
}  

which works like this:

# generate some data: 
# v1, v2 are two expectation variables in the same dataset 
# but have very different 'scale':
> v1 = seq(100, 550, 50)
> v1
  [1] 100 150 200 250 300 350 400 450 500 550
> v2 = sort(sample(seq(.1, 20, .1), 10))
> v2
  [1]  0.2  3.5  5.1  5.6  8.0  8.3  9.9 11.3 15.5 19.4
> mean(v1)
  [1] 325
> mean(v2)
  [1] 8.68

# now normalize v1 & v2 using the function above:
> v1_scaled = ev_scaled(v1)
> v1_scaled
  [1] 0.000 0.111 0.222 0.333 0.444 0.556 0.667 0.778 0.889 1.000
> v2_scaled = ev_scaled(v2)
> v2_scaled
  [1] 0.000 0.172 0.255 0.281 0.406 0.422 0.505 0.578 0.797 1.000
> mean(v1_scaled)
  [1] 0.5
> mean(v2_scaled)
  [1] 0.442
> range(v1_scaled)
  [1] 0 1
> range(v2_scaled)
  [1] 0 1
doug
  • 69,080
  • 24
  • 165
  • 199
1

You can also try Mahalanobis distance instead of Euclidean.

Dima
  • 38,860
  • 14
  • 75
  • 115