15

I have training (X) and test data (test_data_process) set with the same columns and order, as indicated below:

enter image description here

But when I do

predictions = my_model.predict(test_data_process)    

It gives the following error:

ValueError: feature_names mismatch: ['f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'f12', 'f13', 'f14', 'f15', 'f16', 'f17', 'f18', 'f19', 'f20', 'f21', 'f22', 'f23', 'f24', 'f25', 'f26', 'f27', 'f28', 'f29', 'f30', 'f31', 'f32', 'f33', 'f34'] ['MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal', 'YrMoSold'] expected f22, f25, f0, f34, f32, f5, f20, f3, f33, f15, f24, f31, f28, f9, f8, f19, f14, f18, f17, f2, f13, f4, f27, f16, f1, f29, f11, f26, f10, f7, f21, f30, f23, f6, f12 in input data training data did not have the following fields: OpenPorchSF, BsmtFinSF1, LotFrontage, GrLivArea, YrMoSold, FullBath, TotRmsAbvGrd, GarageCars, YearRemodAdd, BedroomAbvGr, PoolArea, KitchenAbvGr, LotArea, HalfBath, MiscVal, EnclosedPorch, BsmtUnfSF, MSSubClass, BsmtFullBath, YearBuilt, 1stFlrSF, ScreenPorch, 3SsnPorch, TotalBsmtSF, GarageYrBlt, MasVnrArea, OverallQual, Fireplaces, WoodDeckSF, 2ndFlrSF, BsmtFinSF2, BsmtHalfBath, LowQualFinSF, OverallCond, GarageArea

So it complains that the training data (X) does not have those fields, whereas it has.

How to solve this issue?

[UPDATE]:

My code:

X = data.select_dtypes(exclude=['object']).drop(columns=['Id'])
X['YrMoSold'] = X['YrSold'] * 12 + X['MoSold']
X = X.drop(columns=['YrSold', 'MoSold', 'SalePrice'])
X = X.fillna(0.0000001)

train_X, val_X, train_y, val_y = train_test_split(X.values, y.values, test_size=0.2)

my_model = XGBRegressor(n_estimators=100, learning_rate=0.05, booster='gbtree')
my_model.fit(train_X, train_y, early_stopping_rounds=5, 
    eval_set=[(val_X, val_y)], verbose=False)

test_data_process = test_data.select_dtypes(exclude=['object']).drop(columns=['Id'])
test_data_process['YrMoSold'] = test_data_process['YrSold'] * 12 + test_data['MoSold']
test_data_process = test_data_process.drop(columns=['YrSold', 'MoSold'])
test_data_process = test_data_process.fillna(0.0000001)
test_data_process = test_data_process[X.columns]

predictions = my_model.predict(test_data_process)    
rcs
  • 6,713
  • 12
  • 53
  • 75
  • can you show your code? I guess you may se dummy coding & number of levels differ in train and test datasets – Edward Sep 30 '18 at 12:52
  • I think you will find the discussion at this github issue helpful; https://github.com/dmlc/xgboost/issues/2334#issuecomment-333195491 – dennissv Sep 30 '18 at 12:52
  • @Edward added my code. please see the update. – rcs Sep 30 '18 at 13:03
  • @sds If you see my code above, it shows the columns have same ordering. For NA/object, I have already do `exclude=['object']` and `fillna`. For zeroes, even I try adding `test_data_process[test_data_process == 0] = 0.0000001` in both `X` and `test_data_process`, it still gives the same error. – rcs Sep 30 '18 at 13:11

2 Answers2

21

Thats an honest mistake.

When feeding your data you are using np arrays:

train_X, val_X, train_y, val_y = train_test_split(X.values, y.values, test_size=0.2)

(X.values is a np.array)

which do not have column names defined

when entering the data set for prediction you are using a dataframe

you should use a numpy array, you can convert it by using:

predictions = my_model.predict(test_data_process.values)  

(add .values)

epattaro
  • 2,330
  • 1
  • 16
  • 29
  • Wow you nailed it. Thanks a lot! – rcs Sep 30 '18 at 13:32
  • 3
    np mate. have you tried using lightgbm? its a better SGD implementation than xgboost. – epattaro Sep 30 '18 at 13:37
  • I've never tried it. Thanks for the info, I'll check it out. – rcs Oct 01 '18 at 00:39
  • 2
    I have a similar problem, however, add .values doesn't resolve the problem. – Aditya Lahiri Aug 17 '19 at 18:51
  • Note that you can also train Xgboost on the pandas dataframe directly if you wish to call the prediction function on the pandas dataframe. But yes, the training and prediction will expect to be called on the same type (either numpy array or pandas dataframe). – Paul Bendevis Nov 04 '20 at 15:07
  • Adding to this answer: this problem occurs when there's a mismatch between `np.array` and `pd.Dataframe`, but it can also happen if, for example, you trained on `np.array` but want to predict with `list`. The solution is, once again, to convert your prediction argument from `list` to `np.array` (if you trained on an `np.array`). – operte Sep 15 '22 at 13:28
1

I also faced the same problem and spent several hours in checking lots of Q&A of SO and GitHub. At last, the problem is solved :). I thank this response of ianozsvald who mentioned that we have to pass numpy array at the start.

In my case, when I was working on the XGBoost separately (when I did not include it as a base learner in the Stacking classifier), no problem was created. However, when multiple base learners including the XGBoost was included in the Stacking classifier and when I was trying to call KernelExplainer of SHAPley Additive Explanations for explaining Stacking classifier, I got the error.

Here is how I solved the problem.

  1. First, I changed the train_x_df to train_x_df.values while fitting the Stacking classifier.
  2. Second, I changed train_x_df to train_x_df.values and passed it as data of KernelExplainer.

In a sentence, to solve the problem, everywhere, we have to use numpy representation of the dataframe (can be done by property .values). Please, remember, executing only the 2nd step does not work (at least in my case) as still, it gets the mismatch.

Md. Sabbir Ahmed
  • 850
  • 8
  • 22