1

I am working on a water quality dataset. The dataset has the following numerical columns: Water Temperature, Turbidity, Wave Height, Wave Period, Battery Life, Transducer known

Before imputing, Water Temperature has 63 missing values. Wave Height has 878 missing values. Wave Period is missing 878 values in the same rows as Wave Height.

I am trying to impute missing values with KNN Imputer. When I finish imputing, and try to replace the old columns (containing NaN-values) with the new columns (which should not contain NaN-values), I still get 6 NaN-values for the Wave Heigth and Wave Period columns, and all 63 missing values remain for the Water Temperature column..

The shape of the dataframes are the same. When I look at the df with imputed values, there are no missing values.

Below is the code. What am I doing wrong? Thanks for your help!

from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=30)
df_int = df.drop(columns=["Beach Name", "Transducer Depth", "Measurement Timestamp 24h"]) #dropping non-numerical values
df_int.shape 

Output: (40705, 6)

imputed_data = imputer.fit_transform(df_int)
imputed_data.shape

Output: (40705, 6)

original_df.shape

Output: (40705, 9) -- Because I dropped 3 columns

df_temp = pd.DataFrame(imputed_data)
df_temp.isna().sum()

Output: Water Temperature 0 Turbidity 0 Wave Height 0 Wave Period 0 Battery Life 0 Transducer known 0 dtype: int64

df_int["Wave Height"] = df_temp["Wave Height"]
df_int["Wave Period"] = df_temp["Wave Period"]
df_int.shape

Output: (40705, 6)

df_int[df_int["Wave Height"].isna()]

Output:

| Index | WTemp | Turb | WH | WL | BtLife| TD

| 40718 | 24.90 | 0.80 | NaN | NaN | 11.00 | 0 | 40719 | 18.60 | 0.37 | NaN | NaN | 11.60 | 0 | 40720 | 14.10 | 0.00 | NaN | NaN | 10.40 | 0 | 40759 | 21.90 | 0.01 | NaN | NaN | 9.40 | 0 | 40780 | 18.90 | 29.55| NaN | NaN | 5.50 | 0 | 40781 | 21.70 | 3.15 | NaN | NaN | 9.40 | 0

df_int.isna().sum()

Output: Water Temperature 63 Turbidity 0 Wave Height 6 Wave Period 6 Battery Life 0 Transducer known 0

I have tried narrowing down when the problem occurs, thinking maybe the shapes of the dataframes were different. This wasn't the case.

mrgoat
  • 19
  • 2

0 Answers0