1

I have this pipeline that handles data like this

num_columns = merged_df.select_dtypes(include=['float64']).columns
cat_columns = merged_df.select_dtypes(include=['object']).drop(['TARGET'], axis=1).columns

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('label', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, num_columns),
        ('cat', categorical_transformer, cat_columns)])

# merged_df = merged_df.fillna(method='ffill')

X = merged_df.drop(['TARGET'],1)
y = merged_df['TARGET']

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2)

rf = Pipeline(steps=[('preprocessor', preprocessor),('classifier',tree.DecisionTreeClassifier())])

rf.fit(X_train, y_train)

The problem is my data is has many missing values. And it's missing because it's literally does not mean to be exist. Example, Subjects for semester 1 in year 2013 could be A, B, C but subjects for semester 1 in year 2015 could be A,C,D. So there will be dataframe with columns A,B,C,D and many missing values in it of course. The problem is when I tried to fit this data with the pipeline which the pipeline is supposed to IMPUTE the missing values It rejects my NaN. I have read same case like me in here but for my case, there's no row which is empty in ALL columns. So because I'm so clueless I fillna it which is dumb because then what's the pipeline for if I simply fillna everything. Please help me.. I have been trying to solve this all day long..

In case you want to see my data view it here it's a spreadsheet.

user15653864
  • 127
  • 2
  • 7

1 Answers1

1

The problem here is that you have null values inside your y vector. This will not be filled by the pipeline.

To fix that you need to modify your y vector such that all values are defined with the following:

merged_df.TARGET = merged_df.TARGET.fillna('NA')

In my point of view, you might have chosen the wrong y variable as this one contains null values.

Antoine Dubuis
  • 4,974
  • 1
  • 15
  • 29
  • oh I see.. I thought it would fill every columns, yeah thanks for reminding me of that, I chose the right target but the target is null because there's a mistake in my code, the original csv's target does not have any null – user15653864 May 26 '21 at 15:13