I was going through its documentation and it says
Each sample’s missing values are imputed using the mean value from n_neighbors nearest neighbors found in the training set. Two samples are close if the features that neither are missing are close.
Now, playing around with a toy dataset, i.e.
>>>X = [[1, 2, nan], [3, 4, 3], [nan, 6, 5], [8, 8, 7]]
>>>X
[[ 1., 2., nan],
[ 3., 4., 3.],
[nan, 6., 5.],
[ 8., 8., 7.]]
And we make a KNNImputer as follows:
imputer = KNNImputer(n_neighbors=2)
The question is, how does it fill the nan
s while having nan
s in 2 of the columns? For example, if it is to fill the nan
in the 3rd column of the 1st row, how will it choose which features are the closest since one of the rows has nan
in the first column as well? When I do imputer.fit_transform(X)
it gives me
array([[1. , 2. , 4. ],
[3. , 4. , 3. ],
[5.5, 6. , 5. ],
[8. , 8. , 7. ]])
which means for filling out the nan
in row 1, the nearest neighbors were the second and the third row. How did it calculate the euclidean distance between the first and the third row?