2

I am building a GBM to calculate something that is very low likelihood and my model is performing in line with random numbers with my features (i.e. badly) so I am trying to use Smote to overcome the domination of my outcomes (98.55% 0, 1.45% 1).

The solutions here seem to imply my issue is coming from the type not being an array but my code is implying it is.

My data looks as follows:

X = num_df.drop(columns=[u'Has Claim'])
y = num_df[u'Has Claim']

X
   Underwriting Year  Public Liability Limit  Employers Liability Limit  \
0               2014                 1000000                          0   
1               2014                 5000000                          0   
2               2014                 5000000                   10000000   
3               2014                 2000000                          0   
4               2014                 1000000                          0   
   Tools Sum Insured  Professional Indemnity Limit  \
0                0.0                         50000   
1                0.0                             0   
2             4000.0                             0   
3             2000.0                             0   
4                0.0                       1000000   

   Contract Works Sum Insured  Hired in Plan Sum Insured  Manual EE  \
0                           0                          0          1   
1                           0                          0          1   
2                           0                          0          1   
3                           0                          0          6   
4                           0                          0          1   

   Clerical EE  Subcontractor EE  rand_1  rand_2  rand_3  rand_4  rand_5  \
0            0                 0       1       2       2       1       5   
1            0                 0       4       3       1       2       2   
2            7                 0       2       2       4       1       5   
3            4                 0       5       4       1       2       2   
4            0                 0       4       3       4       5       2   

   rand_6  rand_7  rand_8  rand_9  rand_10  
0       2       3       5       1        1  
1       4       3       1       1        5  
2       2       5       3       1        5  
3       1       5       1       3        2  
4       5       2       5       4        3  

Y
0    0
1    0
2    0
3    0
4    0
Name: Has Claim, dtype: int64

I do a train test split

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.2, 
                                                    random_state=42)

When I fit my model it works

model.fit(X_train, y_train)
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
   colsample_bytree=0.5, gamma=0, learning_rate=0.1, max_delta_step=0,
   max_depth=5, min_child_weight=1, missing=None, n_estimators=1000,
   n_jobs=1, nthread=4, objective='binary:logistic', random_state=0,
   reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=42, silent=True,
   subsample=0.8)

However if I use

smt = SMOTE()
X_train, y_train = smt.fit_sample(X_train,
                                  y_train)

then refit my model and use

y_pred = model.predict(X_test)

Then I get

ValueError: feature_names mismatch: ['f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'f12', 'f13', 'f14', 'f15', 'f16', 'f17', 'f18', 'f19'] [u'Underwriting Year', u'Public Liability Limit', u'Employers Liability Limit', u'Tools Sum Insured', u'Professional Indemnity Limit', u'Contract Works Sum Insured', u'Hired in Plan Sum Insured', u'Manual EE', u'Clerical EE', u'Subcontractor EE', u'rand_1', u'rand_2', u'rand_3', u'rand_4', u'rand_5', u'rand_6', u'rand_7', u'rand_8', u'rand_9', u'rand_10']
expected f0, f1, f2, f3, f4, f5, f6, f7, f8, f9, f18, f19, f12, f13, f10, f11, f16, f17, f14, f15 in input data
training data did not have the following fields: rand_6, rand_7, rand_4, rand_5, rand_2, rand_3, rand_1, Public Liability Limit, Subcontractor EE, Professional Indemnity Limit, rand_8, rand_9, Manual EE, Employers Liability Limit, rand_10, Contract Works Sum Insured, Underwriting Year, Tools Sum Insured, Clerical EE, Hired in Plan Sum Insured

I am expecting to be able to make predictions using my updated model

Am I misunderstanding how SMOTE works? Am I not applying it correctly?

Violatic
  • 374
  • 2
  • 18
  • Believe you're getting the same problem as [described here](https://stackoverflow.com/a/52578211/3220769) – TomNash May 02 '19 at 16:43
  • 1
    I had the same issue, [this solution](https://stackoverflow.com/questions/50711382/why-am-i-getting-a-valueerror-feature-names-mismatch-when-specifying-the-feat) helped me – Roy Ambar Jun 11 '19 at 20:22

0 Answers0