Understanding sklearn's KNNImputer

Question

I was going through its documentation and it says

Each sample’s missing values are imputed using the mean value from n_neighbors nearest neighbors found in the training set. Two samples are close if the features that neither are missing are close.

Now, playing around with a toy dataset, i.e.

>>>X = [[1, 2, nan], [3, 4, 3], [nan, 6, 5], [8, 8, 7]]
>>>X

   [[ 1.,  2., nan],
    [ 3.,  4.,  3.],
    [nan,  6.,  5.],
    [ 8.,  8.,  7.]]

And we make a KNNImputer as follows:

imputer = KNNImputer(n_neighbors=2)

The question is, how does it fill the nans while having nans in 2 of the columns? For example, if it is to fill the nan in the 3rd column of the 1st row, how will it choose which features are the closest since one of the rows has nan in the first column as well? When I do imputer.fit_transform(X) it gives me

array([[1. , 2. , 4. ],
       [3. , 4. , 3. ],
       [5.5, 6. , 5. ],
       [8. , 8. , 7. ]])

which means for filling out the nan in row 1, the nearest neighbors were the second and the third row. How did it calculate the euclidean distance between the first and the third row?

yatu · Accepted Answer · 2020-05-12T14:23:01.563

How does it fill the NaNs using rows that also have NaNs?

This doesn't seem to be mentioned in the docs. But by digging a bit into the source code, it appears that for each column being imputed, all donors at a smaller distance are considered, even if they have missing values. The way this is handled is by setting to 0 the missing values in a weight matrix, which is obtained according to the used distance, see _get_weights.

The relevant code is in _calc_impute, where after finding a distance matrix for all potential donors, and then the above mentioned matrix of weights, it is imputed as:

# fill nans with zeros
if weight_matrix is not None:
    weight_matrix[np.isnan(weight_matrix)] = 0.0

Where all potential donors are considered if they have at least one non-nan distance with the reciever

dist_pot_donors : ndarray of shape (n_receivers, n_potential_donors)
    Distance matrix between the receivers and potential donors from
    training set. There must be at least one non-nan distance between
    a receiver and a potential donor.

We could check this with a toy example; in the following matrix, when inputting the missing value in [nan, 7., 4., 5.], the last row (which also contains two NaNs) is chosen (note that I've set n_neighbors=1). This is because the distance wrt the last row is 0, as the distance corresponding to the NaN values has been set to 0. So by just having a minimal difference with rows 2 and 3, the last row is chosen since it is seen as being equal:

X = np.array([[np.nan,7,4,5],[2,8,4,5],[3,7,4,6],[1,np.nan,np.nan,5]])

print(X)
array([[nan,  7.,  4.,  5.],
       [ 2.,  8.,  4.,  5.],
       [ 3.,  7.,  4.,  6.],
       [ 1., nan, nan,  5.]])

from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=1)

imputer.fit_transform(X)
array([[1., 7., 4., 5.],
       [2., 8., 4., 5.],
       [3., 7., 4., 6.],
       [1., 7., 4., 5.]])

Can you explain "by just having a minimal difference with rows 2 and 3, the last row is chosen since it is seen as being equal"? — arghhjayy, May 13 '20 at 06:25
So if you notice, the last row and first only share the last colum. Since the rest of values in the las row are nan. And still it is the row used to replace the first row. That is because there is 0 difference. Nans are seen as 0 difference. While rows two and three, die have some error, minimal error but they are different. So the one with minimal error is chosen @arghhjayy — yatu, May 13 '20 at 06:28
Understood. Now, say imputing row 1 column 1 nan was the first step. For the second step, we want to replace the nan in row 4 column 2. Then, for step 2, what will be the value at row 1 column 1, will it be a nan or the imputed value in the first step? — arghhjayy, May 13 '20 at 07:32
NaNs are filled column by column, as you can see in [`process_chunk`](https://github.com/scikit-learn/scikit-learn/blob/95d4f0841d57e8b5f6b2a570312e9d832e69debc/sklearn/impute/_knn.py#L235). So since the nan in the first row is at the left, it will already have been imputed, so the new value should be considered if I'm not missing something @arghhjayy — yatu, May 13 '20 at 07:36

score 1 · Answer 2 · answered Aug 22 '22 at 06:55

The `nan_euclidean_distance`

The kNNImputer in sklearn uses the nan_euclidean_distance. You can find the doc here on Sklearn: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.nan_euclidean_distances.html

It a nutshell, it uses only available features to compute a pseudo Euclidean distance. This way, even observations with missing values can have their distance computed to other observations.

Understanding sklearn's KNNImputer

2 Answers2

The nan_euclidean_distance

The `nan_euclidean_distance`