0

I'm trying to train and run Multi-Class classifiers for Random Forest and Logistic Regression. As of now on my machine which has an 8GB RAM and an i5 core, it's taking quite some time to run inspite of the datasize being hardly 34K records. Is there any way in which i can speed up the current existing run time by tweaking a few parameters?

I'm just giving an example for the Logistic Regression Randomized Search below.

X.shape
Out[9]: (34857, 18)
Y.shape
Out[10]: (34857,)
Y.unique()
Out[11]: array([7, 3, 8, 6, 1, 5, 9, 2, 4], dtype=int64)
params_logreg={'C':[0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0],
            'solver':['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
            'penalty':['l2'],
            'max_iter':[100,200,300,400,500],
            'multi_class':['multinomial']}
folds = 2
n_iter = 2
scoring= 'accuracy'
n_jobs= 1

model_logregression=LogisticRegression()
model_logregression = RandomizedSearchCV(model_logregression,X,Y,params_logreg,folds,n_iter,scoring,n_jobs)

[CV] solver=newton-cg, penalty=l2, multi_class=multinomial, max_iter=100, C=0.9 
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[CV]  solver=newton-cg, penalty=l2, multi_class=multinomial, max_iter=100, C=0.9, score=0.5663798049340218, total= 2.7min
[CV] solver=newton-cg, penalty=l2, multi_class=multinomial, max_iter=100, C=0.9 
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  2.7min remaining:    0.0s

[CV]  solver=newton-cg, penalty=l2, multi_class=multinomial, max_iter=100, C=0.9, score=0.5663625408848338, total= 4.2min
[CV] solver=sag, penalty=l2, multi_class=multinomial, max_iter=400, C=0.8 
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  7.0min remaining:    0.0s

[CV]  solver=sag, penalty=l2, multi_class=multinomial, max_iter=400, C=0.8, score=0.5663798049340218, total=  33.9s
[CV] solver=sag, penalty=l2, multi_class=multinomial, max_iter=400, C=0.8 
[CV]  solver=sag, penalty=l2, multi_class=multinomial, max_iter=400, C=0.8, score=0.5664773053308085, total=  26.6s
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:  8.0min finished```


It's taking about 8 mins to run for Logistic Regression. In contrast RandomForestClassifier takes only about 52 seconds.

Is there any way in which I can make this run faster by tweaking the parameters?

1 Answers1

1

Try to normalize your data for the logistic regression model. Normalized data will help the model converge quickly. Scikit-learn has several methods for this so check their preprocessing section for more information on this.

Also you are using RandomizedSearchCV for regression which takes time because several models are created and computed and compared to get the best parameters.

secretive
  • 2,032
  • 7
  • 16
  • Thanks. This seems like a logical explanation. Also is using sklearn Normalizer the best technique or I could use the Standard scaler or Min Max scaler as well? Would the above 2 techniques also be feasible? – Harshwardhan Nandedkar Aug 02 '19 at 09:53
  • They are all usable options depending on data and needs. Experiment with them based on your data to check what fits best. – secretive Aug 04 '19 at 00:42