I am using a dataset with null values and also a mix of categorical and continuous data. Initially, I replaced the null values in certain columns and then used the SMOTENC in the pipeline with stratifiedKfold ..the accuracy and ROC score is always nan. Can anyone please throw some light on this:
Following is code snippet:
df = read_csv(filename, header=0, na_values='/')
data = df.values
# split into input and output elements
X, y = data[:, :-1], data[:, -1]
df['serogroup'].fillna(value=df['serogroup'].mode()[0],inplace=True)
df['HDL'].fillna(value=df['HDL'].mean(),inplace=True)
df['LDL'].fillna(value=df['LDL'].mean(),inplace=True)
df['HCV-RNATaqman'].fillna(value=df['HCV-RNATaqman'].mean(),inplace=True)
df['HCV-RNAquantity'].fillna(value=df['HCV-RNAquantity'].mean(),inplace=True)
data = df.values
X, y = data[:, :-1], data[:, -1]
y=y.astype('int')
pipeline = Pipeline(steps = [['smote',SMOTENC(categorical_features=[1, 2, 7],
random_state=0)],
['scalar', StandardScaler()],['classifier', RandomForestClassifier()]])
['classifier', RandomForestClassifier()]])
cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=None)
scores = model_selection.cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
print("Score", scores.mean())