How to use both binary and continuous features in the k-Nearest-Neighbor algorithm?

Question

My feature vector has both continuous (or widely ranging) and binary components. If I simply use Euclidean distance, the continuous components will have a much greater impact:

Representing symmetric vs. asymmetric as 0 and 1 and some less important ratio ranging from 0 to 100, changing from symmetric to asymmetric has a tiny distance impact compared to changing the ratio by 25.

I can add more weight to the symmetry (by making it 0 or 100 for example), but is there a better way to do this?

NPE · Accepted Answer · 2010-11-30T19:57:53.623

10

You could try using the normalized Euclidean distance, described, for example, at the end of the first section here.

It simply scales every feature (continuous or discrete) by its standard deviation. This is more robust than, say, scaling by the range (max-min) as suggested by another poster.

edited Nov 30 '10 at 19:57

answered Nov 30 '10 at 16:22

NPE

486,780
108
951
1,012

doug · Answer 2 · 2010-12-01T04:50:57.387

If i correctly understand your question, normalizing (aka 'rescaling) each dimension or column in the data set is the conventional technique for dealing with over-weighting dimensions, e.g.,

ev_scaled = (ev_raw - ev_min) / (ev_max - ev_min)

In R, for instance, you can write this function:

ev_scaled = function(x) {
    (x - min(x)) / (max(x) - min(x))
}

which works like this:

# generate some data: 
# v1, v2 are two expectation variables in the same dataset 
# but have very different 'scale':
> v1 = seq(100, 550, 50)
> v1
  [1] 100 150 200 250 300 350 400 450 500 550
> v2 = sort(sample(seq(.1, 20, .1), 10))
> v2
  [1]  0.2  3.5  5.1  5.6  8.0  8.3  9.9 11.3 15.5 19.4
> mean(v1)
  [1] 325
> mean(v2)
  [1] 8.68

# now normalize v1 & v2 using the function above:
> v1_scaled = ev_scaled(v1)
> v1_scaled
  [1] 0.000 0.111 0.222 0.333 0.444 0.556 0.667 0.778 0.889 1.000
> v2_scaled = ev_scaled(v2)
> v2_scaled
  [1] 0.000 0.172 0.255 0.281 0.406 0.422 0.505 0.578 0.797 1.000
> mean(v1_scaled)
  [1] 0.5
> mean(v2_scaled)
  [1] 0.442
> range(v1_scaled)
  [1] 0 1
> range(v2_scaled)
  [1] 0 1

score 1 · Answer 3 · answered Dec 02 '10 at 21:29

1

You can also try Mahalanobis distance instead of Euclidean.

answered Dec 02 '10 at 21:29

Dima

38,860
14
75
115

How to use both binary and continuous features in the k-Nearest-Neighbor algorithm?

3 Answers3