Speeding up sklearn logistic regression

Question

I have a model I'm trying to build using LogisticRegression in sklearn that has a couple thousand features and approximately 60,000 samples. I'm trying to fit the model and it's been running for about 10 mins now. The machine I'm running it on has gigabytes of RAM and several cores at its disposal and I was wondering if there is any way to speed the process up

EDIT The machine has 24 cores and here is the output of top to give an idea of memory

Processes: 94 total, 8 running, 3 stuck, 83 sleeping, 583 threads      20:10:19
Load Avg: 1.49, 1.25, 1.19  CPU usage: 4.34% user, 0.68% sys, 94.96% idle
SharedLibs: 1552K resident, 0B data, 0B linkedit.
MemRegions: 51959 total, 53G resident, 46M private, 676M shared.
PhysMem: 3804M wired, 57G active, 1042M inactive, 62G used, 34G free.
VM: 350G vsize, 1092M framework vsize, 52556024(0) pageins, 85585722(0) pageouts
Networks: packets: 172806918/25G in, 27748484/7668M out.
Disks: 14763149/306G read, 26390627/1017G written.

I'm trying to train the model with the following

classifier = LogisticRegression(C=1.0, class_weight = 'auto')
classifier.fit(train, response)

train has rows that are approximately 3000 long (all floating point) and each row in response is either 0 or 1. I have approximately 50,000 observations

its surprising that it is taking that long. are you sure you've set up your model correctly? — agconti, Jan 03 '14 at 00:59
Post some real stats on your machine? The difference between 1 and 8GB of RAM is pretty big, so is the difference between 2 and 8 cores. Not to mention that neither of those is very relevant when talking about single-core less-than-a-gigabyte processes. — Slater Victoroff, Jan 03 '14 at 01:02
I've added edits to address some of these comments. I showed a top output for RAM simply because I'm not the only one using the machine so ALL of the physical memory is not always accessible to me but it looks like I SHOULD have enough — sedavidw, Jan 03 '14 at 01:14

stachyra · Accepted Answer · 2017-04-24T15:02:15.877

UPDATE - 2017:

In current version of scikit-learn, LogisticRegression() now has n_jobs parameter to utilize multiple cores.

However, the actual text of the user guide suggests that multiple cores are still only being utilized during the second half of the computation. As of this update, the revised user guide for LogisticRegression now says that njobs chooses the "Number of CPU cores used during the cross-validation loop" whereas the other two items cited in the original response, RandomForestClassifier() and RandomForestRegressor(), both state that njobs specifies "The number of jobs to run in parallel for both fit and predict". In other words, the deliberate contrast in phrasing here seems to be pointing out that the njobs parameter in LogisticRegression(), while now implemented, is not really implemented as completely, or in the same way, as in the other two cases.

Thus, while it may now be possible to speed up LogisticRegression() somewhat by using multiple cores, my guess is that it probably won't be very linear in proportion to the number of cores used, as it sounds like the initial "fit" step (the first half of the algorithm) may not lend itself well to parallelization.

Original Answer:

To my eye, it looks like the major issue here isn't memory, it's that you are only using one core. According to top, you are loading the system at 4.34%. If your logistic regression process is monopolizing 1 core out of 24, then that comes out to 100/24 = 4.167%. Presumably the remaining 0.17% accounts for whatever other processes you are also running on the machine, and they are allowed to take up an extra 0.17% because they are being scheduled by the system to run in parallel on a 2nd, different core.

If you follow the links below and look at the scikit-learn API, you'll see that some of the ensemble methods such as RandomForestClassifier() or RandomForestRegressor() have an input parameter called n_jobs which directly controls the number of cores on which the package will attempt to run in parallel. The class that you are using, LogisticRegression() doesn't define this input. The designers of scikit-learn seem to have created an interface which is generally pretty consistent between classes, so if a particular input parameter is not defined for a given class, it probably means that the developers simply could not figure out a way to implement the option in a meaningful way for that class. It may be the case that the logistic regression algorithm simply doesn't lend itself well to parallelization; i.e., the potential speedup that could have been achieved just wasn't good enough to have justified implementing it with a parallel architecture.

Assuming that this is the case, then no, there's not much you can do to make your code go faster. 24 cores doesn't help you if the underlying library functions simply weren't designed to be able to take advantage of them.

I guess I was hoping there was a way to parallelize but it looks like you're correct in that there is no good way to do this. I'll look into other forms of classification — sedavidw, Jan 03 '14 at 13:49

score 12 · Answer 2 · answered Aug 06 '14 at 15:49

12

Try reducing data set size and changing tolerance parameter. For example you can try classifier = LogisticRegression(tol = 0.1)

answered Aug 06 '14 at 15:49

Maksud

767
5
8

score 12 · Answer 3 · answered Dec 06 '17 at 13:31

12

The default solver for LogisticRegressin in sklearn is liblinear which is a suitable solver for normal datasets. For large datasets try the stochastic gradient descent solvers such as sag:

model = LogisticRegression(solver='sag')

answered Dec 06 '17 at 13:31

Hossein

2,041
1
16
29

What about newton-sq, is it faster? – Rocketq May 16 '18 at 13:40

score 9 · Answer 4 · answered Mar 30 '17 at 20:04

9

Worth noting that now LogisticRegression() accepts num_jobs as input and defaults to 1.

Would have commented on the accepted answer, but not enough points.

answered Mar 30 '17 at 20:04

Scrocco

133
1
4

Michael James Kali Galarnyk · Answer 5 · 2021-11-17T20:41:27.260

Try changing your solver. The documentation says that scikit-learn has 5 different solvers you can use ('liblinear', 'sag', 'saga', 'newton-cg', 'lbfgs')

For small datasets, ‘liblinear’ (used to be the default) is a good choice, whereas ‘sag’ and ‘saga’ are faster for large ones.

For multiclass problems, only ‘newton-cg’, ‘sag’, ‘saga’ and ‘lbfgs’ handle multinomial loss; ‘liblinear’ is limited to one-versus-rest schemes.

import time
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# Set training and validation sets
X, y = make_classification(n_samples=1000000, n_features=1000, n_classes = 2)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=10000)

# Solvers
solvers = ['liblinear', 'newton-cg', 'sag', 'saga', 'lbfgs']

for sol in solvers: 
    start = time.time()
    logreg = LogisticRegression(solver=sol)
    logreg.fit(X_train, y_train)
    end = time.time()
    print(sol + " Fit Time: ",end-start)

Output (from 16GB 4 core MacBook):

Choosing the right solver for a problem can save a lot of time (code adapted from here).To determine which solver is right for your problem, you can check out the table from the documentation to learn more.

Also, since you aren't doing multiclass classification, your model might not parallelize well. According to the scikit-learn documentation, n_jobs is the number of CPU cores used when parallelizing over classes if multi_class=’ovr’.

How large is large? Large in what sense, num of features or num of observations? The docs does not seem to specify. — Nuclear03020704, Mar 23 '21 at 13:23
While this is probably not a helpful answer, I think big would be if it is taking too long to train. Even for MNIST with liblinear in this blog (https://towardsdatascience.com/logistic-regression-using-python-sklearn-numpy-mnist-handwriting-recognition-matplotlib-a6b31e2b166a), it took 2893.1 seconds to run with a accuracy of 91.45%. When solver = lbfgs, it took 52.86 seconds to run with an accuracy of 91.3%. — Michael James Kali Galarnyk, Mar 24 '21 at 22:31

Speeding up sklearn logistic regression

5 Answers5

UPDATE - 2017:

Original Answer:

Linked