I've built a logistic regression classifier on a few sets of comment data from a forum, but the model is taking ages to converge (14-16 hours). I need the output from statsmodels
to show the goodness of fit for the model, so using sklearn
is unfortunately not an option. This post had similar challenges to mine but no solution. I'm not sure if this is a better question for the stats SO, but the most similar questions to mine were posted here!
Model details: There are about 100,000 comments/samples and 5000 features. The feature space is top n words in the dataset, normalized (TF-IDF if you're familiar). Because of this, the feature set is very sparse. I'm using a dense matrix to represent this.
I tested a model that was 1000 samples x 1000 features, and that fits and regualarizes in 1-2sec, so taking over 16 hours to regularize seems extreme. In sklearn
, the full feature and sample set finishes fitting/regularizing in about 10 seconds.
Here's my code that I'm using:
#X_train, X_test, y_train, y_test are from sklearn's train_test_split function.
#X_train is features for training data, Y_train are the classes, etc.
#these are all scipy dense matrices
#add constants to the feature set
sm.add_constant(X_train, prepend = True)
sm.add_constant(X_test, prepend = True)
#assign alpha
a_base = 1
alpha = a_base * np.ones(X_train.shape[1], dtype=np.float64)
alpha[0] = 0 #don't penalize the last value which is our intercept
#fit and regularize
logit = sm.Logit(y_train, X_train)
results=logit.fit_regularized(method="l1_cvxopt_cp", alpha=alpha, disp=True)
Happy to provide more code and details if it's useful!