7

I trained the logistic model using the following, from breast cancer data and ONLY using one feature 'mean_area'

from statsmodels.formula.api import logit
logistic_model = logit('target ~ mean_area',breast)
result = logistic_model.fit()

There is a built in predict method in the trained model. However that gives the predicted values of all the training samples. As follows

predictions = result.predict()

Suppose I want the prediction for a new value say 30 How do I used the trained model to out put the value? (rather than reading the coefficients and computing manually)

vishmay
  • 386
  • 2
  • 4
  • 15

4 Answers4

6

You can provide new values to the .predict() model as illustrated in output #11 in this notebook from the docs for a single observation. You can provide multiple observations as 2d array, for instance a DataFrame - see docs.

Since you are using the formula API, your input needs to be in the form of a pd.DataFrame so that the column references are available. In your case, you could use something like .predict(pd.DataFrame({'mean_area': [1,2,3]}).

statsmodels .predict() uses the observations used for fitting only as default when no alternative is provided.

Stefan
  • 41,759
  • 13
  • 76
  • 81
  • Thanks for the answer. I had a look at the notebook, however in my case when I try to give .predict(30) it throws an error " 'int' object has no attribute '__getitem__'' . – vishmay Aug 15 '16 at 20:08
  • 1
    You are getting this error because the `exog` parameter has to be `array-like`, so you'd have to use `[30]`. Arrays have `getitem` method because they can contain multiple items in contrast to `int`. – Stefan Aug 15 '16 at 20:34
  • Thanks when I try .predict([30]) I get the following error. "TypeError: list indices must be integers, not str" – vishmay Aug 15 '16 at 20:53
  • Sorry because of the formula api the input as to be as `DataFrame`, see updated answer. – Stefan Aug 15 '16 at 21:11
  • Note that you can simply pass a dictionary into any of statsmodel's API's that accept dataframes - there is no need to create a dataframe unnecessarily. The following example notebook shows this in the last step: https://www.statsmodels.org/dev/examples/notebooks/generated/predict.html – flutefreak7 Mar 13 '19 at 19:38
  • The links listed are now dead. – Max Power Aug 19 '21 at 08:43
1
import statsmodels.formula.api as smf


model = smf.ols('y ~ x', data=df).fit()

# Predict for a list of observations, list length can be 1 to many..**
prediction = model.get_prediction(exog=dict(x=[5,10,25])) 
prediction.summary_frame(alpha=0.05)
help-ukraine-now
  • 3,850
  • 4
  • 19
  • 36
silly
  • 887
  • 9
  • 9
0

I had difficulty predicting values using a fresh pandas dataframe. So I added data to be predicted to original dataset post fitting

   y = data['price']
   x1 = data[['size', 'year']]
   data.columns
   #Index(['price', 'size', 'year'], dtype='object')
   x=sm.add_constant(x1)
   results = sm.OLS(y,x).fit()
   results.summary()
   ## predict on unknown data
   data = data.append(pd.DataFrame({'size': [853.0,777], 'year': [2012.0,2013], 'price':[None, None]}))
   data.tail()
   new_x = data.loc[data.price.isnull(), ['size', 'year']]
   results.predict(sm.add_constant(new_x))
Karan Bhandari
  • 370
  • 3
  • 12
0

This is already answered but I hope this will help.

According to the documentation, the first parameter is "exog".

exog : array_like, optional The values for which you want to predict

Further it says,

"If a formula was used, then exog is processed in the same way as the original data. This transformation needs to have key access to the same variable names, and can be a pandas DataFrame or a dict like object that contains numpy arrays.

If no formula was used, then the provided exog needs to have the same number of columns as the original exog in the model. No transformation of the data is performed except converting it to a numpy array.

Row indices as in pandas data frames are supported, and added to the returned prediction"

from statsmodels.formula.api import logit

logistic_model = logit('target ~ mean_area',breast)
result = logistic_model.fit()

Therefore, you can provide a pandas dataframe (Ex: df) for the exog parameter and the dataframe should contain mean_area as a column. Because 'mean_area' is the predictor or the independent variable.

predictions = logistic_model.predict(exog=df)
cresclux
  • 76
  • 3