I encounter this problem when I implement the Knn imputation method for handling missing data from scratch. I create a dummy dataset and find the nearest neighbors for rows that contain missing values here is my dataset
A B C D E
0 NaN 2.0 4.0 10.0 100.0
1 NaN 3.0 9.0 12.0 NaN
2 5.0 2.0 20.0 50.0 75.0
3 3.0 5.0 7.0 NaN 150.0
4 2.0 9.0 7.0 30.0 90.0
for row 0 the nearest neighbors are 1 and 2 and to replace the NaN value at (0, A) we compute the distance average between the nearest neighbors value in the same column but what if one of the nearest neighbors value is also NaN?
Example:
let suppose the nearest neighbors for row 3 is 2 and 4 so in row 3 the missing value in column D and to replace this missing value we compute distance average between nearest neighbors value in column D which is like that
distance average = [(1/D1) * 50.0 + (1/D2) * 30.0]/2
and replace the nan value at (3, D) with this average (where D1 and D2 are corresponding nan euclidian distance). But in the case of row 0 the nearest neighbor is 1 and 2 and to replace the nan value at (0, A ) we need to compute the distance average between row 1 and 2 value in column A the value at (2, A) is 5.0 great but at (1, A) it's NaN so we can't compute like that
distance average = [(1/D3) * NaN + (1/D4) * 5.0]/2
so how do we replace the NaN value at (0, A)? and how does sklearn KNNImputer handle this kind of scenario?