I am building a GBM to calculate something that is very low likelihood and my model is performing in line with random numbers with my features (i.e. badly) so I am trying to use Smote to overcome the domination of my outcomes (98.55% 0, 1.45% 1).
The solutions here seem to imply my issue is coming from the type not being an array but my code is implying it is.
My data looks as follows:
X = num_df.drop(columns=[u'Has Claim'])
y = num_df[u'Has Claim']
X
Underwriting Year Public Liability Limit Employers Liability Limit \
0 2014 1000000 0
1 2014 5000000 0
2 2014 5000000 10000000
3 2014 2000000 0
4 2014 1000000 0
Tools Sum Insured Professional Indemnity Limit \
0 0.0 50000
1 0.0 0
2 4000.0 0
3 2000.0 0
4 0.0 1000000
Contract Works Sum Insured Hired in Plan Sum Insured Manual EE \
0 0 0 1
1 0 0 1
2 0 0 1
3 0 0 6
4 0 0 1
Clerical EE Subcontractor EE rand_1 rand_2 rand_3 rand_4 rand_5 \
0 0 0 1 2 2 1 5
1 0 0 4 3 1 2 2
2 7 0 2 2 4 1 5
3 4 0 5 4 1 2 2
4 0 0 4 3 4 5 2
rand_6 rand_7 rand_8 rand_9 rand_10
0 2 3 5 1 1
1 4 3 1 1 5
2 2 5 3 1 5
3 1 5 1 3 2
4 5 2 5 4 3
Y
0 0
1 0
2 0
3 0
4 0
Name: Has Claim, dtype: int64
I do a train test split
X_train, X_test, y_train, y_test = train_test_split(X,
y,
test_size=0.2,
random_state=42)
When I fit my model it works
model.fit(X_train, y_train)
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bytree=0.5, gamma=0, learning_rate=0.1, max_delta_step=0,
max_depth=5, min_child_weight=1, missing=None, n_estimators=1000,
n_jobs=1, nthread=4, objective='binary:logistic', random_state=0,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=42, silent=True,
subsample=0.8)
However if I use
smt = SMOTE()
X_train, y_train = smt.fit_sample(X_train,
y_train)
then refit my model and use
y_pred = model.predict(X_test)
Then I get
ValueError: feature_names mismatch: ['f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'f12', 'f13', 'f14', 'f15', 'f16', 'f17', 'f18', 'f19'] [u'Underwriting Year', u'Public Liability Limit', u'Employers Liability Limit', u'Tools Sum Insured', u'Professional Indemnity Limit', u'Contract Works Sum Insured', u'Hired in Plan Sum Insured', u'Manual EE', u'Clerical EE', u'Subcontractor EE', u'rand_1', u'rand_2', u'rand_3', u'rand_4', u'rand_5', u'rand_6', u'rand_7', u'rand_8', u'rand_9', u'rand_10']
expected f0, f1, f2, f3, f4, f5, f6, f7, f8, f9, f18, f19, f12, f13, f10, f11, f16, f17, f14, f15 in input data
training data did not have the following fields: rand_6, rand_7, rand_4, rand_5, rand_2, rand_3, rand_1, Public Liability Limit, Subcontractor EE, Professional Indemnity Limit, rand_8, rand_9, Manual EE, Employers Liability Limit, rand_10, Contract Works Sum Insured, Underwriting Year, Tools Sum Insured, Clerical EE, Hired in Plan Sum Insured
I am expecting to be able to make predictions using my updated model
Am I misunderstanding how SMOTE works? Am I not applying it correctly?