What are the practical ways to speed up the following Logistic Regression Analysis on Python. Here are my current setup
Hardware
i5-3350P Quad Core
16GB DDR3
256GB Samsung 840EVO SSD
Quadro FX3000 GPU
Software
Win7x64
Anaconda 5.0.0 with Python3.6
Run on script on Jupyter Notebook
Here is the randomly generated dataset
%%time
# Generating some random dataset (1min 7s)
import numpy as np
import pandas as pd
cols = 13
rows = 10000000
raw_data = np.random.randint(2, size=cols*rows).reshape(rows, cols)
col_names = ['v01','v02','v03','v04','v05','v06','v07',
'v08','v09','v10','v11','v12','outcome']
df = pd.DataFrame(raw_data, columns=col_names)
df['v11'] = df['v03'].apply(lambda x: ['t1','t2','t3','t4'][np.random.randint(4)])
df['v12'] = df['v03'].apply(lambda x: ['p1','p2'][np.random.randint(2)])
Here is the code to run logistic regression analysis
%%time
# run logistic regression (3min 2s)
import statsmodels.formula.api as smf
logit_formula = 'outcome ~ v01 + v02 + v03 + v04 + v05 + v06 + v07 + v08 + v09 + v10 + C(v11) + C(v12)'
logit_model = smf.logit(formula=logit_formula, data=df).fit()
print(logit_model.summary())
What else can I do speed up the logit analysis further, hopefully down to just few seconds
- should I use different python library/framework instad of statsmodels?
- should I invest in a better hardware? CPU? GPU? RAM?
- should I use different python distribution? or different IDE?
- should feed the data from some kind of database server? clustering? parallel computing...?
Please advice, thank you very much