Let's say we have list of people and would like to find people like person X
.
The feature vector has 3 items [weight, height, age]
and there are 3 persons in our list. Note that we don't know height of person C.
A: [70kg, 170cm, 60y]
B: [60kg, 169cm, 50y]
C: [60kg, ?, 50y]
What would be the best way to find people closest to person A?
My guess
Let's calculate the average value for height, and use it instead of unknown value.
So, let's say we calculated that 170cm
is average value for height, and redefining person C
as [60kg, ~170cm, 50y]
.
Now we can find people closest to A, it will be A, C, B
.
Problem
Now, the problem is that we put C
with guessed ~170cm
before than B
with known 169cm
.
It kinda feels wrong. We humans are smarter than machines, and know that there's little chance that C
will be exactly 170cm
. So, it would be better to put B with 169cm
before than C
.
But how can we calculate that penalty? (preferably in simple empiric algorithm) Should we somehow penalise vectors with unknown values? And by how much (maybe calculate average diff between every two person's height in the set)?
And how would that penalisation look like in a general case when dimension of feature vector is N
and it has K
known items and U
unknown (K + U = N
)?