mpi4py does not speed up embarrisingly parallelizable code

Question

I am trying to diagonalize a large number of totally independent matrix computations using numpy.linalg, and openmpi/mpi4py on a 6-core intel xeon machine.

When running with N processes, each matrix computation seems to take N times longer so that the total time for the computation is the same (actually a bit slower) than the non-parallel version.

E.g. here's a simple code that just diagonalizes 12 random 1000x1000 matrices:

import numpy as np
import numpy.linalg as la
import os
import random
import time

# MPI imports ----
from mpi4py import MPI

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
N_nodes = comm.Get_size()

t0 = time.time()

def dosomework(N):
    matrix = np.random.rand(N,N)
    matrix = matrix+matrix.T
    la.eig(matrix)
    return 1

N_tot = 12    
N_per_process = np.int(np.ceil(N_tot/N_nodes))
worker_data = []
for j in range(N_per_process):
    dosomework(1000)
    print 'Node:',rank,'; iteration:',j,'; time:',time.time()-t0

if rank==0:
    print 'done, time = ',time.time()-t0

This takes about 6sec with one process, 6sec with two processes, and 9 sec with 4 processes. Can anyone tell me what is going on? Why is this embarrassingly parallelizable code with no MPI communication not getting a speedup from running in parallel?

If I run the same code, but replacing the matrix diagonalization with a non-scypi.linalg

What processor / memory do you run this on? It reads like you mean `done, time = 6 seconds`, but my observation is rather that the each `iteration time: 6 seconds`. Which time do you expect to go down? — Zulan, Dec 04 '16 at 12:48

score 0 · Answer 1 · answered Dec 05 '16 at 18:49

I resolved the issue: the linear algebra was using MKL, with a default setting to use all available threads for one process, at which point there were no other resources for the other processes, and the parallel code was basically executing in serial with each process taking turns using the whole cpu.

To fix this, I can just add the commands

import mkl
mkl.set_num_threads(1)

which limit each diagonalization to one thread, and now parallelization speeds things up as expected.

mpi4py does not speed up embarrisingly parallelizable code

1 Answers1