Python xgb: ValueError: "feature_names mismatch"

Question

I'm trying to learn the basics of XGBoost and devises a script that splits some data I found on Kaggle about Corona virus outbreaks in China. The code and model work, but some some reason when I use the model to make a new prediction I get a "ValueError: feature_names mismatch." The new test data has a 2-d array with 2 values, just like the test data, but I still get a value error.

train = df[['RegionCode','ProvinceCode']].astype(int)
test = df['infected'].astype(int)

X_test, X_train, y_test, y_train = train_test_split(train, test, test_size=0.2, random_state=42)

train = xgb.DMatrix(X_train, label=y_train)
test = xgb.DMatrix(X_test, label=y_test)

param = {
'max_depth':4,
'eta':0.3,
'num_class': 2}
epochs = 10

model = xgb.train(param, train, epochs)

All the code above works, but the terst below gives me the error:

testArray=np.array([[13, 67]])

test_individual = xgb.DMatrix(testArray)

print(model.predict(test_individual))

Any idea what I'm doing wrong?

You are not splitting the data properly, please go through my [answer](https://stackoverflow.com/questions/60636444/what-is-the-difference-between-x-test-x-train-y-test-y-train-in-sklearn/60637924?noredirect=1#comment107322867_60637924) on another post for clarity. — ManojK, Mar 13 '20 at 17:28

score 0 · Answer 1 · answered Mar 15 '20 at 05:19

0

Seems like you are missing out on the basics of using the train_test_split function of sklearn.

X_test, X_train, y_test, y_train = train_test_split(train, test, test_size=0.2, random_state=42)

The line above expects the train to have all the features to be used for training, while the test expects the target feature.

Try fixing that first.

answered Mar 15 '20 at 05:19

Vatsal Gupta

471
3
8

But isn't that what I'm doing? I have train = df[['RegionCode','ProvinceCode']].astype(int) and test = df['infected'].astype(int). train is my features and test is target. – Tyrone_Slothrop Mar 15 '20 at 15:41
1

Oh! I just had a look again. You need to provide the column names in the testArray that you are taking. That would solve it. – Vatsal Gupta Mar 15 '20 at 16:24
Thank you! I don't quite understand what you mean by providing the column name? I specify int for the new X value, so where would I input the column names? Can you provide a sample of what you mean? Thanks for your time. – Tyrone_Slothrop Mar 15 '20 at 17:15
Create a dataframe with one row [13, 67] and the same column names as provided in the train i.e. ['RegionCode','ProvinceCode'] , then try using the predict function. Hope that works – Vatsal Gupta Mar 16 '20 at 02:11

Python xgb: ValueError: "feature_names mismatch"

1 Answers1