1

I'm trying to learn the basics of XGBoost and devises a script that splits some data I found on Kaggle about Corona virus outbreaks in China. The code and model work, but some some reason when I use the model to make a new prediction I get a "ValueError: feature_names mismatch." The new test data has a 2-d array with 2 values, just like the test data, but I still get a value error.

train = df[['RegionCode','ProvinceCode']].astype(int)
test = df['infected'].astype(int)

X_test, X_train, y_test, y_train = train_test_split(train, test, test_size=0.2, random_state=42)

train = xgb.DMatrix(X_train, label=y_train)
test = xgb.DMatrix(X_test, label=y_test)

param = {
'max_depth':4,
'eta':0.3,
'num_class': 2}
epochs = 10

model = xgb.train(param, train, epochs)

All the code above works, but the terst below gives me the error:

testArray=np.array([[13, 67]])

test_individual = xgb.DMatrix(testArray)

print(model.predict(test_individual))

Any idea what I'm doing wrong?

nz_21
  • 6,140
  • 7
  • 34
  • 80
Tyrone_Slothrop
  • 177
  • 4
  • 11
  • You are not splitting the data properly, please go through my [answer](https://stackoverflow.com/questions/60636444/what-is-the-difference-between-x-test-x-train-y-test-y-train-in-sklearn/60637924?noredirect=1#comment107322867_60637924) on another post for clarity. – ManojK Mar 13 '20 at 17:28

1 Answers1

0

Seems like you are missing out on the basics of using the train_test_split function of sklearn.

X_test, X_train, y_test, y_train = train_test_split(train, test, test_size=0.2, random_state=42)

The line above expects the train to have all the features to be used for training, while the test expects the target feature.

Try fixing that first.

Vatsal Gupta
  • 471
  • 3
  • 8
  • But isn't that what I'm doing? I have train = df[['RegionCode','ProvinceCode']].astype(int) and test = df['infected'].astype(int). train is my features and test is target. – Tyrone_Slothrop Mar 15 '20 at 15:41
  • 1
    Oh! I just had a look again. You need to provide the column names in the testArray that you are taking. That would solve it. – Vatsal Gupta Mar 15 '20 at 16:24
  • Thank you! I don't quite understand what you mean by providing the column name? I specify int for the new X value, so where would I input the column names? Can you provide a sample of what you mean? Thanks for your time. – Tyrone_Slothrop Mar 15 '20 at 17:15
  • Create a dataframe with one row [13, 67] and the same column names as provided in the train i.e. ['RegionCode','ProvinceCode'] , then try using the predict function. Hope that works – Vatsal Gupta Mar 16 '20 at 02:11