Speed up logistic regression in Python

Question

What are the practical ways to speed up the following Logistic Regression Analysis on Python. Here are my current setup

Hardware

i5-3350P Quad Core
16GB DDR3
256GB Samsung 840EVO SSD
Quadro FX3000 GPU

Software

Win7x64
Anaconda 5.0.0 with Python3.6
Run on script on Jupyter Notebook

Here is the randomly generated dataset

%%time
# Generating some random dataset (1min 7s)

import numpy as np
import pandas as pd

cols = 13
rows = 10000000
raw_data = np.random.randint(2, size=cols*rows).reshape(rows, cols)
col_names = ['v01','v02','v03','v04','v05','v06','v07', 
             'v08','v09','v10','v11','v12','outcome']
df = pd.DataFrame(raw_data, columns=col_names)
df['v11'] = df['v03'].apply(lambda x: ['t1','t2','t3','t4'][np.random.randint(4)])
df['v12'] = df['v03'].apply(lambda x: ['p1','p2'][np.random.randint(2)])

Here is the code to run logistic regression analysis

%%time
# run logistic regression (3min 2s)

import statsmodels.formula.api as smf

logit_formula = 'outcome ~ v01 + v02 + v03 + v04 + v05 + v06 + v07 + v08 + v09 + v10 + C(v11) + C(v12)'
logit_model = smf.logit(formula=logit_formula, data=df).fit()
print(logit_model.summary())

What else can I do speed up the logit analysis further, hopefully down to just few seconds

should I use different python library/framework instad of statsmodels?
should I invest in a better hardware? CPU? GPU? RAM?
should I use different python distribution? or different IDE?
should feed the data from some kind of database server? clustering? parallel computing...?

Please advice, thank you very much

I'm not familiar with the implementation details of `statsmodels` logistic regression, but you should consider looking at [`sklearn`](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html), they provide several different solvers, some of which can be parallelized. Certainly worth comparing the performance — juanpa.arrivillaga, Oct 13 '17 at 23:19
You need to check where your script is spending the time. On my computer I get 1min 20sec for creating the model with the formula, and 12 to 15 seconds for the fit. Creating the model from predefined numpy arrays adds a negligible amount of time, with 12 to 15 seconds including the fit. The fit varies a bit by optimization method, often 'bfgs' and 'lbfgs' are the fastest. — Josef, Oct 13 '17 at 23:42
Wow, 12-15 seconds that's great! Which fit method did you pick? Something is slowing my PC down... may I know what your PC spec is? — Scoodood, Oct 14 '17 at 04:02
i7-4790 intel processor running Windows 8.1 with WinPython which uses numpy and scipy with MKL linalg libraries. 16GB RAM. `logit_model2 = Logit(logit_model.endog, logit_model.exog).fit()` where logit_model is the model from the example build with formula. Optimizer defautl fit() which uses newton takes 12s, lbfgs takes around 16s in this example. — Josef, Oct 14 '17 at 12:57
BTW: I'm running statsmodels master, but I don't remember any speedups that would affect Logit since 0.8. — Josef, Oct 14 '17 at 13:49
During the fitting process, my i5-3350P CPU is only at 25% utilization, maybe it’s not CPU bound. I am wondering if it’s due your MKL library... — Scoodood, Oct 15 '17 at 15:49
What is your time just for the `fit`? (see my first comment) In my initial run including the formula, cpu usage went briefly up to 50% with 8 virtual cores. That must be MKL multiprocessing, but for most of the time the cpu usage was full load on one processor. I think `fit` does not use enough linear algebra in this case that MKL parallel computation makes much difference (in `fit`). — Josef, Oct 15 '17 at 19:38
My fitting process alone is about 3 mins on Win7x64 virtualbox VM. The host is Win7x64 i5-3350p Quad Core CPU, 16GB DDR3, and the VM running the fitting process is allocated with 4 cores CPU, 10GB DDR3. During the fitting process, CPU utilization is about 25%, memory is about 70-80% (close to max) — Scoodood, Oct 16 '17 at 21:24

Speed up logistic regression in Python

0 Answers0