2

I've built a logistic regression classifier on a few sets of comment data from a forum, but the model is taking ages to converge (14-16 hours). I need the output from statsmodels to show the goodness of fit for the model, so using sklearn is unfortunately not an option. This post had similar challenges to mine but no solution. I'm not sure if this is a better question for the stats SO, but the most similar questions to mine were posted here!

Model details: There are about 100,000 comments/samples and 5000 features. The feature space is top n words in the dataset, normalized (TF-IDF if you're familiar). Because of this, the feature set is very sparse. I'm using a dense matrix to represent this.

I tested a model that was 1000 samples x 1000 features, and that fits and regualarizes in 1-2sec, so taking over 16 hours to regularize seems extreme. In sklearn, the full feature and sample set finishes fitting/regularizing in about 10 seconds.

Here's my code that I'm using:

#X_train, X_test, y_train, y_test are from sklearn's train_test_split function. 
#X_train is features for training data, Y_train are the classes, etc.
#these are all scipy dense matrices

#add constants to the feature set
sm.add_constant(X_train, prepend = True)
sm.add_constant(X_test, prepend = True)

#assign alpha
a_base = 1
alpha = a_base * np.ones(X_train.shape[1], dtype=np.float64)
alpha[0] = 0 #don't penalize the last value which is our intercept

#fit and regularize
logit = sm.Logit(y_train, X_train)
results=logit.fit_regularized(method="l1_cvxopt_cp", alpha=alpha, disp=True)

Happy to provide more code and details if it's useful!

Community
  • 1
  • 1
Stevie
  • 189
  • 1
  • 2
  • 15

1 Answers1

1

Implementing Goodness of Fit

You could use the Pearson goodness of fit statistic or the Deviance statistic to accomplish this rather easily.

Pearson goodness of fit statistic

$$X^2=\sum_{i}\frac{(O_i - E_i)^2}{E_i}$$

Deviance statistic

$$G^2=2\sum_{i}O_i\log(\frac{O_i}{E_i})$$

Your observed data in this case would be some set of labelled data, your expected would be what your model predicts. The implementation I will leave up to you but it certainly wouldn't end up taking 14-16 hours for even 100,000 samples (if you had that many labelled). Once you have the statistic value you can run a chi squared test given your test statistic, degrees of freedom, and your desired confidence level. Or you can omit the confidence level and just use the resulting p-value to tell you how well your model "fits".

Why you shouldn't use Goodness of Fit

Now all of that being said, I don't really like the idea of a single "Goodness of Fit" for a model. Each problem is different and should really utilize any of a number of metrics for analyzing how well it actually performs at a specific prediction task. For that reason sklearn has provided you with a plethora of methods for doing just that. Take some time to look over the sklearn.metrics class, as well as the documentation section on Model Evaluation. This is just what sklearn has implemented and only a tiny sample of the literature on this subject is much. This is a dense topic and can lead you down a very deep and dark rabbit hole, as well as lead to some heated debate among team members or fellow researchers.

What is important is to understand what your model is actually trying to capture and predict. What do the ranges of success and failure actually look like? Once you understand that completely you can then identify a measure of "accuracy" that captures that in an intelligent way.

Community
  • 1
  • 1
Grr
  • 15,553
  • 7
  • 65
  • 85
  • Thank you for your response! In addition to getting the accuracy/precision recall/ROC curve, we also need the deviance and goodness of fit. That's why I turned to statsmodels over sklearn. I've been working on how to get these (or dervie them) using sklearn for days becuase of how FAST the algorithm trains. Can you provide any insight into this? – Stevie Apr 12 '17 at 18:10
  • The two formulas I included will both give you a test statistic related to "goodness of fit". I apologize for the LaTex, but I figure the more people use it the more likely stackoverflow is to implement it on this site. One of these is in fact deviance, you can use that for your goodness of fit chi squared test if you like. As far as implementing it, that is just a matter of getting the counts of observed predictions vs expected and doing a little math. – Grr Apr 12 '17 at 18:28
  • No worries about the LaTeX, it's silly that SO doesn't convert them - I threw them into a LaTeX doc so I can see them. sklearn's logistic regression has a function `predict_proba()' that provides the probability estimates for each sample and each class (so [0.8,0.15],[0.98,0.2], etc). I also see that 'predict_log_proba()' provides the log probabilities. Are these the kinds of data I need to make these calculations? I appreciate your help as I work through this - I'm still learning these statistics myself. – Stevie Apr 12 '17 at 18:49
  • Yea you could use that. Those probabilities represent the classes so you could extract the most likely predicted class for each item from that, or use the `predict` method which I believe will output just the most likely class. – Grr Apr 12 '17 at 21:23