I have this pipeline that handles data like this
num_columns = merged_df.select_dtypes(include=['float64']).columns
cat_columns = merged_df.select_dtypes(include=['object']).drop(['TARGET'], axis=1).columns
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='mean')),
('scaler', StandardScaler())])
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('label', OneHotEncoder(handle_unknown='ignore'))])
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, num_columns),
('cat', categorical_transformer, cat_columns)])
# merged_df = merged_df.fillna(method='ffill')
X = merged_df.drop(['TARGET'],1)
y = merged_df['TARGET']
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2)
rf = Pipeline(steps=[('preprocessor', preprocessor),('classifier',tree.DecisionTreeClassifier())])
rf.fit(X_train, y_train)
The problem is my data is has many missing values. And it's missing because it's literally does not mean to be exist. Example, Subjects for semester 1 in year 2013 could be A, B, C but subjects for semester 1 in year 2015 could be A,C,D. So there will be dataframe
with columns A,B,C,D and many missing values in it of course. The problem is when I tried to fit this data with the pipeline which the pipeline is supposed to IMPUTE the missing values It rejects my NaN. I have read same case like me in here but for my case, there's no row which is empty in ALL columns. So because I'm so clueless I fillna
it which is dumb because then what's the pipeline for if I simply fillna
everything. Please help me.. I have been trying to solve this all day long..
In case you want to see my data view it here it's a spreadsheet.