1

What are the practical ways to speed up the following Logistic Regression Analysis on Python. Here are my current setup

Hardware

i5-3350P Quad Core
16GB DDR3
256GB Samsung 840EVO SSD
Quadro FX3000 GPU

Software

Win7x64
Anaconda 5.0.0 with Python3.6
Run on script on Jupyter Notebook

Here is the randomly generated dataset

%%time
# Generating some random dataset (1min 7s)

import numpy as np
import pandas as pd

cols = 13
rows = 10000000
raw_data = np.random.randint(2, size=cols*rows).reshape(rows, cols)
col_names = ['v01','v02','v03','v04','v05','v06','v07', 
             'v08','v09','v10','v11','v12','outcome']
df = pd.DataFrame(raw_data, columns=col_names)
df['v11'] = df['v03'].apply(lambda x: ['t1','t2','t3','t4'][np.random.randint(4)])
df['v12'] = df['v03'].apply(lambda x: ['p1','p2'][np.random.randint(2)])

Here is the code to run logistic regression analysis

%%time
# run logistic regression (3min 2s)

import statsmodels.formula.api as smf

logit_formula = 'outcome ~ v01 + v02 + v03 + v04 + v05 + v06 + v07 + v08 + v09 + v10 + C(v11) + C(v12)'
logit_model = smf.logit(formula=logit_formula, data=df).fit()
print(logit_model.summary())

What else can I do speed up the logit analysis further, hopefully down to just few seconds

  • should I use different python library/framework instad of statsmodels?
  • should I invest in a better hardware? CPU? GPU? RAM?
  • should I use different python distribution? or different IDE?
  • should feed the data from some kind of database server? clustering? parallel computing...?

Please advice, thank you very much

Scoodood
  • 583
  • 1
  • 5
  • 13
  • I'm not familiar with the implementation details of `statsmodels` logistic regression, but you should consider looking at [`sklearn`](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html), they provide several different solvers, some of which can be parallelized. Certainly worth comparing the performance – juanpa.arrivillaga Oct 13 '17 at 23:19
  • 1
    You need to check where your script is spending the time. On my computer I get 1min 20sec for creating the model with the formula, and 12 to 15 seconds for the fit. Creating the model from predefined numpy arrays adds a negligible amount of time, with 12 to 15 seconds including the fit. The fit varies a bit by optimization method, often 'bfgs' and 'lbfgs' are the fastest. – Josef Oct 13 '17 at 23:42
  • Wow, 12-15 seconds that's great! Which fit method did you pick? Something is slowing my PC down... may I know what your PC spec is? – Scoodood Oct 14 '17 at 04:02
  • i7-4790 intel processor running Windows 8.1 with WinPython which uses numpy and scipy with MKL linalg libraries. 16GB RAM. `logit_model2 = Logit(logit_model.endog, logit_model.exog).fit()` where logit_model is the model from the example build with formula. Optimizer defautl fit() which uses newton takes 12s, lbfgs takes around 16s in this example. – Josef Oct 14 '17 at 12:57
  • BTW: I'm running statsmodels master, but I don't remember any speedups that would affect Logit since 0.8. – Josef Oct 14 '17 at 13:49
  • During the fitting process, my i5-3350P CPU is only at 25% utilization, maybe it’s not CPU bound. I am wondering if it’s due your MKL library... – Scoodood Oct 15 '17 at 15:49
  • What is your time just for the `fit`? (see my first comment) In my initial run including the formula, cpu usage went briefly up to 50% with 8 virtual cores. That must be MKL multiprocessing, but for most of the time the cpu usage was full load on one processor. I think `fit` does not use enough linear algebra in this case that MKL parallel computation makes much difference (in `fit`). – Josef Oct 15 '17 at 19:38
  • My fitting process alone is about 3 mins on Win7x64 virtualbox VM. The host is Win7x64 i5-3350p Quad Core CPU, 16GB DDR3, and the VM running the fitting process is allocated with 4 cores CPU, 10GB DDR3. During the fitting process, CPU utilization is about 25%, memory is about 70-80% (close to max) – Scoodood Oct 16 '17 at 21:24

0 Answers0