-2

#Num has too much unique values and we cant do simple imputation because each unique value has such a low count that setting all 72 nulls to that value will skew the results so we just randomly impute the nulls using the existing values

num_values = X_train['Num'].dropna().values
# Shuffle non-missing values multiple times
num_shuffled = num_values.copy()  # Make a copy to shuffle multiple times
np.random.shuffle(num_shuffled)
num_shuffled=pd.Series(num_shuffled)
# Fill missing values with the shuffled values
X_train['Num']=X_train['Num'].fillna(num_shuffled)

The number of nulls decreased but did not go to zero even though there is a lot more values in num_shuffled then the number of nulls

  • Please clarify your specific problem or provide additional details to highlight exactly what you need. As it's currently written, it's hard to tell exactly what you're asking. – Community Aug 26 '23 at 09:20
  • "The number of nulls decreased but did not go to zero even though there is a lot more values in num_shuffled then the number of nulls" - in your own words, **why should that be sufficient**? When `fillna` takes a value from `num_shuffled`, how do you think it decides which one to take? How long do you think `num_shuffled` should need to be, in order to solve the problem, and why do you think this is so? – Karl Knechtel Aug 26 '23 at 09:24
  • (Hint: did you try to check how many values are in `num_shuffled`? Do you see any pattern, in which values are still missing in `X_train['Num']` afterwards? Do you see how that pattern is related to the length of `num_shuffled`? Did you try [reading the documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html) in order to understand how it works?) – Karl Knechtel Aug 26 '23 at 09:26

0 Answers0