-1

I've seen plenty of questions about this topic but couldn't find any clear answer to solve my problem: I save a model with the following code:

clf = SVC(gamma=1,C=1)
clf.fit(X_train,y_train)
# save the model to disk
filename = 'finalized_model.sav'
pickle.dump(clf, open(filename, 'wb'))

I then load it with a different file:

# load the model from disk
fname = 'finalized_model.sav'
clf = pickle.load(open(fname, 'rb'))
y_pred = clf.predict(df_live)

I get this error:

ValueError: X.shape[1] = 22 should be equal to 26, the number of features at training time

when I prepare the data, I use:

df_dummies = pd.get_dummies(df)

and the reason I get more features is because the training data is much larger then the predicted data and hence more categorized features.

My question is what is the best practice to make the number of features even without harming the model?

Thanks

asher
  • 174
  • 1
  • 1
  • 12
  • You need to preserve the features you used when training the data. `get_dummies` is not good here. Use OneHotEncoder instead and save the training instance. And load it to transform the features just as you saved and loaded the final model. Search stackoverflow for similar questions. – Vivek Kumar Sep 04 '18 at 07:05

1 Answers1

0

In general, you can perform data imputation to take care of missing data, but with entire features missing, unless you can provide meaningful values for the four missing ones, chances are you're better off simply removing them from X_train before fitting.

fuglede
  • 17,388
  • 2
  • 54
  • 99
  • Thanks. But this model needs to run on new data almost daily. This means they might differ by different features each time. Do you mean I should preform the fitting every time I run the prediction model and remove the irrelevant features from X_train before fitting according to the prediction features? – asher Sep 01 '18 at 11:54
  • Depends. If you want to stick to a single model (which you may want for practical reasons if estimation is slow, or if you've validated the model and determined it useful), then your setup sounds like imputation could be possible. If you have enough data, you could always put your proposed solution to the test and see how it performs. – fuglede Sep 01 '18 at 17:17