3

I want to use XGBRegressor to predict some data. So I load the training data and the test data.

iowa_file_path = '../input/train.csv'
test_data_path = '../input/test.csv'

data = pd.read_csv(iowa_file_path)
test_data = pd.read_csv(test_data_path)

Contents of data

enter image description here

Contents of test_data

enter image description here

Then I do some data cleaning

data.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = data.SalePrice
X = data.drop(['SalePrice'], axis=1).select_dtypes(exclude=['object'])

train_X, val_X, train_y, val_y = train_test_split(X.values, y.values, test_size =0.25)
my_imputer = SimpleImputer()
train_X = my_imputer.fit_transform(train_X)
val_X = my_imputer.transform(val_X)

my_model = XGBRegressor(n_estimators=100, learning_rate=0.1)
my_model.fit(train_X, train_y, early_stopping_rounds=None, 
    eval_set=[(val_X, val_y)], verbose=False)

test_data_process = test_data.select_dtypes(exclude=['object'])
predictions = my_model.predict(test_data_process)

But I get the following error message when running predict function:


ValueError Traceback (most recent call last) in () 1 test_data_process = test_data.select_dtypes(exclude=['object']) ----> 2 predictions = my_model.predict(test_data_process)

/opt/conda/lib/python3.6/site-packages/xgboost-0.80-py3.6.egg/xgboost/sklearn.py in predict(self, data, output_margin, ntree_limit, validate_features) 395 output_margin=output_margin, 396 ntree_limit=ntree_limit, --> 397 validate_features=validate_features) 398 399 def apply(self, X, ntree_limit=0):

/opt/conda/lib/python3.6/site-packages/xgboost-0.80-py3.6.egg/xgboost/core.py in predict(self, data, output_margin, ntree_limit, pred_leaf, pred_contribs, approx_contribs, pred_interactions, validate_features) 1206 1207 if validate_features: -> 1208 self._validate_features(data) 1209 1210 length = c_bst_ulong()

/opt/conda/lib/python3.6/site-packages/xgboost-0.80-py3.6.egg/xgboost/core.py in _validate_features(self, data) 1508 1509 raise ValueError(msg.format(self.feature_names, -> 1510 data.feature_names)) 1511 1512 def get_split_value_histogram(self, feature, fmap='', bins=None, as_pandas=True):

ValueError: feature_names mismatch: ['f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'f12', 'f13', 'f14', 'f15', 'f16', 'f17', 'f18', 'f19', 'f20', 'f21', 'f22', 'f23', 'f24', 'f25', 'f26', 'f27', 'f28', 'f29', 'f30', 'f31', 'f32', 'f33', 'f34', 'f35', 'f36'] ['Id', 'MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal', 'MoSold', 'YrSold'] expected f9, f6, f14, f27, f18, f7, f8, f23, f17, f22, f35, f0, f28, f29, f20, f31, f36, f25, f11, f21, f12, f24, f34, f10, f5, f32, f15, f26, f30, f1, f2, f16, f19, f3, f4, f33, f13 in input data training data did not have the following fields: BsmtUnfSF, 1stFlrSF, LowQualFinSF, MSSubClass, WoodDeckSF, GrLivArea, MiscVal, YearBuilt, BsmtFinSF1, Fireplaces, MoSold, BsmtHalfBath, GarageYrBlt, FullBath, PoolArea, YrSold, HalfBath, 2ndFlrSF, KitchenAbvGr, OverallQual, Id, EnclosedPorch, ScreenPorch, GarageArea, BsmtFullBath, MasVnrArea, TotRmsAbvGrd, OverallCond, BedroomAbvGr, GarageCars, OpenPorchSF, YearRemodAdd, TotalBsmtSF, BsmtFinSF2, LotFrontage, 3SsnPorch, LotArea

It complains that the feature mismatches and that I do not have those fields in the training data. But when I check on content of data, it has those columns. How to resolve it ?

smci
  • 32,567
  • 20
  • 113
  • 146
rcs
  • 6,713
  • 12
  • 53
  • 75
  • 1
    You haven't used SimpleImputer on the test data. Is there any data missing there? You can also have a look at https://github.com/dmlc/xgboost/issues/2334 – KPLauritzen Sep 19 '18 at 05:46
  • Yes you are right. I've just run the SimpleImputer and now it works. Thanks, – rcs Sep 19 '18 at 05:53

1 Answers1

2

Just to close the question:

The problem is that SimpleImputer was used on the training and validation data, but not on the test data.

A discussion of what can cause this kind of error can be found here: https://github.com/dmlc/xgboost/issues/2334#issuecomment-333195491

KPLauritzen
  • 1,719
  • 13
  • 23