5

I trained a xgboost model in Python using the Scikit-Learn Python API, and serialized it using pickle library. I uploaded the model to ML Engine, but when I try to do online predictions, i get the following exception:

Prediction failed: Exception during xgboost prediction: can not initialize DMatrix from DMatrix

An example of the json I'm using for prediction is the following:

{  
   "instances":[  
      [  
         24.90625,
         21.6435643564356,
         20.3762376237624,
         24.3679245283019,
         30.2075471698113,
         28.0947368421053,
         16.7797359774725,
         14.9262079299572,
         17.9888028979966,
         15.3333284503293,
         19.6535308744024,
         17.1501961307627,
         0.0,
         0.0,
         0.0,
         0.0,
         0.0,
         509.0,
         497.0,
         439.0,
         427.0,
         407.0,
         1.0,
         1.0,
         1.0,
         1.0,
         1.0,
         2.0,
         23.0,
         10.0,
         58.0,
         11.0,
         20.0,
         23.3617021276596,
         23.3617021276596,
         23.3617021276596,
         23.3617021276596,
         23.3617021276596,
         23.9423076923077,
         26.3082269243683,
         23.6212606363851,
         22.6752334301282,
         27.4343583104833,
         34.0090408101173,
         11.1991944104063,
         7.33420726455092,
         8.15160392948917,
         11.4119236389594,
         17.9429092915607,
         18.0573102225845,
         32.8902876598084,
         -0.00286123032904149,
         -0.00286123032904149,
         -0.00286123032904149,
         -0.00286123032904149,
         -0.00286123032904149,
         -0.0028328611898017,
         0.0534138904223018,
         0.0534138904223018,
         0.0534138904223018,
         0.0534138904223018,
         0.0534138904223018,
         0.0531491870801522
      ]
   ]
}

I use the following code to train my model:

def _train_model(X, y):
    clf = xgb.XGBClassifier(max_depth=6,
                            learning_rate=0.01,
                            n_estimators=100,
                            n_jobs=-1)
    clf.fit(X, y)
    return clf

Where X and y are both numpy.ndarray:

Type of X: <class 'numpy.ndarray'> Type of y: <class 'numpy.ndarray'>

Also I'm using xgboost 0.72.1, Python 3.5 and ML runtime 1.9.

Any one knows what can be the source of the problem?

Thanks!

Ismail
  • 1,068
  • 1
  • 6
  • 11
Lukas
  • 405
  • 7
  • 13

3 Answers3

5

Seems like the issue is due to the pickling. I was able to reproduce it and working on a fix, but meanwhile could you try exporting your classifier like below instead?

clf._Booster.save_model('./model.bst') 

That should unblock you for now. If it didn't, feel free to reach out to cloudml-feedback@google.com.

N3da
  • 4,323
  • 5
  • 20
  • 22
  • 1
    Can you please provide an example of how you should load back such a model, @N3da? – kuza Jan 28 '19 at 13:49
0

I also faced similar problem or feature mismatch when I tried score the test data using the the trained XGBoost model that was dumped in .pkl format. However after saving the model in .bst format, I was able to score the same training data without any issues. Looks like there is a difference in the two implementations of .pkl and .bst format when it comes to XGBoost.

saurabhg07
  • 23
  • 3
0

Going a little further, and answering kuza's question above on loading the saved model:

save model:

clf._Booster.save_model('./model.bst') 

loading the saved model:

model = xgboost.Booster({'nthread': 4})  # initialize before loading model
model.load_model('./model.bst')  # load model

This cleared up 2 issues that I had with using pickle on the model. Issue 1 was a weird exeption: ValueError: feature_names mismatch:

Also check if you are using predict_proba on the loaded model, and getting a weird exception. The fix for that was just to use the straight predict function vice predict_proba.