Data Cleaning Error in Classification KNN Alrogithm Problem

Question

I believe the error is telling me I have null values in my data and I've tried fixing it but the error keeps appearing. I don't want to delete the null data because I consider it relevant to my analysis. The columns of my data are in this order: 'Titulo', 'Autor', 'Género', 'Año Leido', 'Puntaje', 'Precio', 'Año Publicado', 'Paginas', **'Estado.' **The ones in bold are strings data.

Code:

import numpy as np
#Load Data
import pandas as pd
dataset = pd.read_excel(r"C:\Users\renat\Documents\Data Science Projects\Classification\Book Purchases\Biblioteca.xlsx")
#print(dataset.columns)

#Import KNeighborsClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer

#Handling missing values
imputer = SimpleImputer(missing_values = np.nan, strategy='mean')

#Convert X and y to NumPy arrays
X=dataset.iloc[:,:-1].values
y=dataset.iloc[:,8].values
print(X.shape, y.shape)

# Crea una instancia de LabelEncoder
labelEncoderTitulo = LabelEncoder()
X[:, 0] = labelEncoderTitulo.fit_transform(X[:, 0])

labelEncoderAutor = LabelEncoder()
X[:, 1] = labelEncoderAutor.fit_transform(X[:, 1])

labelEncoderGenero = LabelEncoder()
X[:, 2] = labelEncoderGenero.fit_transform(X[:, 2])

labelEncoderEstado = LabelEncoder()
X[:, -1] = labelEncoderEstado.fit_transform(X[:, -1])

#Instantiate our KNeighborsClassifier
knn=KNeighborsClassifier(n_neighbors=3)
knn.fit(X,y)

y_pred = knn.predict(X)

print(y_pred)

Error Message: ValueError: Input X contains NaN. KNeighborsClassifier does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

Yes, you are correct. The error tells you the classifier does not handle NaN values. You say you handle them, however, you are creating an Imputer object and storing it in a variable but you don't use it. — DataJanitor, Feb 14 '23 at 13:45
`LabelEncoder` should not be used for encoding features: [sklearn.preprocessing.LabelEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html) — DataJanitor, Feb 14 '23 at 13:47

DataJanitor · Answer 1 · 2023-02-24T12:16:07.727

You have to fit and transform the data with the SimpleImputer you created. From the documentation:

import numpy as np
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(missing_values=np.nan, strategy='mean')  # Here the imputer is created
imputer.fit([[7, 2, 3], [4, np.nan, 6], [10, 5, 9]])  # Here the imputer is fitted, i.e. learns the mean

X = [[np.nan, 2, 3], [4, np.nan, 6], [10, np.nan, 9]]
print(imputer.transform(X))  # Here the imputer is applied, i.e. filling the mean

The crucial parts here are imputer.fit() and imputer.transform(X)

Additionally I'd use another technique to handle categorical data since LabelEncoder is not suitable here:

This transformer should be used to encode target values, i.e. y, and not the input X.

For alternatives see here: How to consider categorical variables in distance based algorithms like KNN or SVM?

score 0 · Answer 2 · answered Feb 17 '23 at 19:59

You need SimpleImputer to impute the missing values in X. We fit the imputer on X and then transform X to replace the NaN values with the mean of the column.After imputing missing values, we encode the target variable using LabelEncoder.

    imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
X = imputer.fit_transform(X)

# Encode target variable
labelEncoderEstado = LabelEncoder()
y = labelEncoderEstado.fit_transform(y)

Data Cleaning Error in Classification KNN Alrogithm Problem

2 Answers2