35

I want to use Python multiprocessing to run grid search for a predictive model. When I look at core usage, it always seem to be using only one core. Any idea what I'm doing wrong?

import multiprocessing
from sklearn import svm
import itertools

#first read some data
#X will be my feature Numpy 2D array
#y will be my 1D Numpy array of labels

#define the grid        
C = [0.1, 1]
gamma = [0.0]
params = [C, gamma]
grid = list(itertools.product(*params))
GRID_hx = []

def worker(par, grid_list):
    #define a sklearn model
    clf = svm.SVC(C=g[0], gamma=g[1],probability=True,random_state=SEED)
    #run a cross validation function: returns error
    ll = my_cross_validation_function(X, y, model=clf, n=1, test_size=0.2)
    print(par, ll)
    grid_list.append((par, ll))


if __name__ == '__main__':
   manager = multiprocessing.Manager()
   GRID_hx = manager.list()
   jobs = []
   for g in grid:
      p = multiprocessing.Process(target=worker, args=(g,GRID_hx))
      jobs.append(p)
      p.start()
      p.join()

   print("\n-------------------")
   print("SORTED LIST")
   print("-------------------")
   L = sorted(GRID_hx, key=itemgetter(1))
   for l in L[:5]:
      print l
YakovL
  • 7,557
  • 12
  • 62
  • 102
ADJ
  • 4,892
  • 10
  • 50
  • 83
  • 1
    Once you fix that join, you may also want to read up on the Global Interpreter Lock (GIL). Python cannot be executing python code on two threads at the same time. However, in the case of C libraries for python like numpy, those libraries can *elect* to give up the GIL while doing very computationally intensive tasks. If you want to use multiple cores effectively, make sure most of your work is done in one of those C libraries which drops the GIL while doing work. – Cort Ammon Apr 22 '15 at 18:08
  • Note: you probably want to use a [`Pool`](https://docs.python.org/3/library/multiprocessing.html#multiprocessing.pool.Pool) instead of manually creating and joining every single process. Just do `pool.map(worker, args=zip(grid, [GRID_hx]*len(grid))])` and this will automatically launch the different processes (in parallel) and join them. – Bakuriu Apr 22 '15 at 18:32
  • 5
    @CortAmmon What you are writing is completely irrelevant. He is using multi **processing** not multi *threading*, so the GIL doesn't plain *any* role in that code. Also: the fact that he's using `multiprocessing` instead of `threading` probably means he already know of the GIL. – Bakuriu Apr 22 '15 at 18:33
  • @Bakuriu you are absolutely right. I'm sorry I missed that! – Cort Ammon Apr 22 '15 at 19:39

3 Answers3

50

Your problem is that you join each job immediately after you started it:

for g in grid:
    p = multiprocessing.Process(target=worker, args=(g,GRID_hx))
    jobs.append(p)
    p.start()
    p.join()

join blocks until the respective process has finished working. This means that your code starts only one process at once, waits until it is finished and then starts the next one.

In order for all processes to run in parallel, you need to first start them all and then join them all:

jobs = []
for g in grid:
    p = multiprocessing.Process(target=worker, args=(g,GRID_hx))
    jobs.append(p)
    p.start()

for j in jobs:
    j.join()

Documentation: link

C.B.
  • 8,096
  • 5
  • 20
  • 34
helmbert
  • 35,797
  • 13
  • 82
  • 95
6

According to the documentation the join() command locks the current thread until the specified thread returns. So you are basically starting each thread in the for loop and then wait for it to finish, BEFORE you proceed to the next iteration.

I would suggest moving the joins outside the loop!

Robin Nabel
  • 2,170
  • 1
  • 21
  • 26
5

I'd say :

for g in grid:
    g.p = multiprocessing.Process(target=worker, args=(g,GRID_hx))
    jobs.append(g.p)
    g.p.start()
for g in grid:
    g.p.join()

Currently you're spawning a job, then waithing for it to be done, then going to the next one.

pppery
  • 3,731
  • 22
  • 33
  • 46
Calvin1602
  • 9,413
  • 2
  • 44
  • 55