0

I have a data-set(breast-cancer detection) with all numerical data and have divided the data-set into X(containing all features) and y(output class).After splitting the data into training and test sets I am facing an issue on applying feature scaling.On applying feature scaling I am getting an Value-Error: could not convert string to float: '?'.Although I have already replaced '?' with -9999 previously.

X=df.iloc[:,:-1].values
y=df.iloc[:,-1].values

#Now splitting data into training and test data.

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=0)

#Replacing '?' with -9999.

df=df.replace('?',-9999)
from sklearn.preprocessing import LabelEncoder

#Applying label encoding on y.

le = LabelEncoder()
y = le.fit_transform(y)
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
X = np.array(ct.fit_transform(X))
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train[:, 1:] = sc.fit_transform(X_train[:, 1:])
X_test[:, 1:] = sc.transform(X_test[:, 1:])

#After this I am getting value error.So how can I ensure that the '?' are not remaining in the data or is there any categorical encoding to be done?

mtr_007
  • 59
  • 1
  • 10

0 Answers0