I have setup simple experiment to check importance of the multi core CPU while running sklearn GridSearchCV
with KNeighborsClassifier
. The results I got are surprising to me and I wonder if I misunderstood the benefits of multi cores or maybe I haven't done it right.
There is no difference in time to completion between 2-8 jobs. How come ? I have noticed the difference on a CPU Performance tab. While the first cell was running CPU usage was ~13% and it was gradually increasing to 100% for the last cell. I was expecting it to finish faster. Maybe not linearly faster aka 8 jobs would be 2 times faster then 4 jobs but a bit faster.
This is how I set it up:
I am using jupyter-notebook, cell refers to jupyter-notebook cell.
I have loaded MNIST and used 0.05
test size for 3000
digits in a X_play
.
from sklearn.datasets import fetch_mldata
from sklearn.model_selection import train_test_split
mnist = fetch_mldata('MNIST original')
X, y = mnist["data"], mnist['target']
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]
_, X_play, _, y_play = train_test_split(X_train, y_train, test_size=0.05, random_state=42, stratify=y_train, shuffle=True)
In the next cell I have setup KNN
and a GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
knn_clf = KNeighborsClassifier()
param_grid = [{'weights': ["uniform", "distance"], 'n_neighbors': [3, 4, 5]}]
Then I done 8 cells for 8 n_jobs values. My CPU is i7-4770 with 4 cores 8 threads.
grid_search = GridSearchCV(knn_clf, param_grid, cv=3, verbose=3, n_jobs=N_JOB_1_TO_8)
grid_search.fit(X_play, y_play)
Results
Parallel(n_jobs=1)]: Done 18 out of 18 | elapsed: 2.0min finished
Parallel(n_jobs=2)]: Done 18 out of 18 | elapsed: 1.4min finished
Parallel(n_jobs=3)]: Done 18 out of 18 | elapsed: 1.3min finished
Parallel(n_jobs=4)]: Done 18 out of 18 | elapsed: 1.3min finished
Parallel(n_jobs=5)]: Done 18 out of 18 | elapsed: 1.4min finished
Parallel(n_jobs=6)]: Done 18 out of 18 | elapsed: 1.4min finished
Parallel(n_jobs=7)]: Done 18 out of 18 | elapsed: 1.4min finished
Parallel(n_jobs=8)]: Done 18 out of 18 | elapsed: 1.4min finished
Second test
Random Forest Classifier usage was much better. Test size was 0.5
, 30000
images.
from sklearn.ensemble import RandomForestClassifier
rf_clf = RandomForestClassifier()
param_grid = [{'n_estimators': [20, 30, 40, 50, 60], 'max_features': [100, 200, 300, 400, 500], 'criterion': ['gini', 'entropy']}]
Parallel(n_jobs=1)]: Done 150 out of 150 | elapsed: 110.9min finished
Parallel(n_jobs=2)]: Done 150 out of 150 | elapsed: 56.8min finished
Parallel(n_jobs=3)]: Done 150 out of 150 | elapsed: 39.3min finished
Parallel(n_jobs=4)]: Done 150 out of 150 | elapsed: 35.3min finished
Parallel(n_jobs=5)]: Done 150 out of 150 | elapsed: 36.0min finished
Parallel(n_jobs=6)]: Done 150 out of 150 | elapsed: 34.4min finished
Parallel(n_jobs=7)]: Done 150 out of 150 | elapsed: 32.1min finished
Parallel(n_jobs=8)]: Done 150 out of 150 | elapsed: 30.1min finished