I am trying to diagonalize a large number of totally independent matrix computations using numpy.linalg, and openmpi/mpi4py on a 6-core intel xeon machine.
When running with N processes, each matrix computation seems to take N times longer so that the total time for the computation is the same (actually a bit slower) than the non-parallel version.
E.g. here's a simple code that just diagonalizes 12 random 1000x1000 matrices:
import numpy as np
import numpy.linalg as la
import os
import random
import time
# MPI imports ----
from mpi4py import MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
N_nodes = comm.Get_size()
t0 = time.time()
def dosomework(N):
matrix = np.random.rand(N,N)
matrix = matrix+matrix.T
la.eig(matrix)
return 1
N_tot = 12
N_per_process = np.int(np.ceil(N_tot/N_nodes))
worker_data = []
for j in range(N_per_process):
dosomework(1000)
print 'Node:',rank,'; iteration:',j,'; time:',time.time()-t0
if rank==0:
print 'done, time = ',time.time()-t0
This takes about 6sec with one process, 6sec with two processes, and 9 sec with 4 processes. Can anyone tell me what is going on? Why is this embarrassingly parallelizable code with no MPI communication not getting a speedup from running in parallel?
If I run the same code, but replacing the matrix diagonalization with a non-scypi.linalg