1

I have been able to do the model predict using result = logit.fit().

Now for testing and validation set shall we just do result.predict(test_df[features]) and result.predict(vald_df[features]) ? Is that all? Or am I missing some step? How it would be different when I try to deploy the model for daily prediction ?

I am new to statsmodel, in fact started today and kind of short of time. I checked a few blogs, information are disjointed, so just wanted to be sure.

Also, is there a way we can directly extract 'Area under ROC' from statsmodel rather than coding our way through?

CARTman
  • 717
  • 1
  • 6
  • 14
  • For predict that's it. Once you have the results instance from a fitted model, you can just call `predict` on new data. The new data needs to match the structure of the original data. If you used formulas, then the data will be transformed in the same way as the training data. If you provided directly a design matrix as DataFrame or numpy array, then the data for prediction needs to match this, e.g. you need to include the constant explicitly, it's not added automatically. – Josef Dec 05 '16 at 18:40

1 Answers1

0

For the first question, each ML algorithm (trees, logistic regression, ...) has parameters. to find best parameters for un algorithm, we train multiple models with different parameters and we keep the model(parameter) that gives the best score on the validation data set. Now this score does not give you an idea of what score will give you once in production(prediction) for that you test your model with the best parameter on the test dataset and this final score gives you an idea of how your model will perform on production.

For the second question, you can use skit-learn , i google and i found thoses examples http://www.programcreek.com/python/example/82598/sklearn.metrics.auc

  • Both my queries are specific to develop solution in python package names 'statsmodel'. I know scikit learn can give me the metrics but I am using statsmodel as it is not easy to extract p-values for coefficients in scikit learn which I need, can get from statsmodel – CARTman Dec 05 '16 at 17:58
  • statsmodels doesn't have AUC, but I think you can call the scikit-learn AUC function with the results from statsmodels. – Josef Dec 05 '16 at 18:37
  • That did not occur to me.. let me give it a try – CARTman Dec 05 '16 at 18:49
  • Just one clarification required, when we predict we get a ndarray or a pandas-series as return value. Am I wrong to assume that the order of the predictions are same as that of the input array? I just want to be sure that predictions can be merged to input array to identify rows for the predictions. – CARTman Dec 05 '16 at 19:45
  • @CARTman In general yes. The only problem that existed or still exists is if there are nans or na in the prediction data and the formula is used. In that case patsy removes the nan/na rows and the returned prediction will be missing those rows. This has been changed recently if the data for predict is a pandas DataFrame, or more generally, has an index attribute. – Josef Dec 05 '16 at 19:56
  • Thnaks again ! I am using Pandas dataframe. – CARTman Dec 06 '16 at 05:24