Optimization: correct way to pass large array to map using pathos multiprocessing

Question

I need to perform a function on elements of a large array and I want to optimize the code using multiprocessing so that I can utilize all the cores on a supercomputer.

This is a follow up to the question I asked here.

I used the code:

import numpy as np
from scipy import misc, ndimage
import itertools
from pathos.multiprocessing import ProcessPool
import time

start = time.time()

#define the original array a as
a= np.load('100by100by100array.npy')

n= a.ndim #number of dimensions
imx, imy, imz = np.shape(a)
print('imx,imy,imz = '+str(imx)+','+str(imy)+','+str(imz))

#define the distorted array b as
imsd = 1 #how blurry the distorted image is
b = ndimage.filters.gaussian_filter(a, imsd, order=0, output=None, mode='reflect', cval=0.0)#, truncate=4.0)

#define a window/weighting function, i.e. a Gaussian for which the local values are computed

wsd = float(1.5) #sd for the window function

#create a n dimensional distribution
x, y, z = np.ogrid[0:imx, 0:imy, 0:imz]
r = (x**2 + y**2 + z**2)**0.5

G1= (2*np.pi)**(-(np.float(n)/2))*(wsd**(np.float(n)))*np.exp(-0.5*wsd**(-2)*(r)**2)

def SSIMfunc(triple):
    cenx=triple[0]
    ceny=triple[1]
    cenz=triple[2]
    G= np.roll(G1, shift=cenx, axis=0)
    G= np.roll(G, shift=ceny, axis=1)
    if cenz is not None:
        G= np.roll(G, shift=cenz, axis=2)

    #define the mean
    mu_a = np.sum(G*a)
    mu_b = np.sum(G*b)

    #define the sample standard deviation

    sd_a = (np.sum(G*(a - mu_a)**2))**(0.5)
    sd_b = (np.sum(G*(b - mu_b)**2))**(0.5)
    sd_ab= np.sum(G*(a - mu_a)*(b - mu_b))

    C1=0.001
    l= (2*mu_a*mu_b+C1)/(mu_a**2 + mu_b**2 +C1)

    C2=0.001
    c= (2*sd_a*sd_b+C2)/(sd_a**2 + sd_b**2 +C2)

    C3=0.001
    s= (sd_ab +C3)/(sd_a*sd_b +C3)

    return l*c*s

pool = ProcessPool()
SSIMarray = np.zeros((imx,imy,imz))

if __name__ == '__main__':
    results = pool.map(SSIMfunc, list(itertools.product(range(0,imx), range(0,imy), range(0,imz))))
SSIMarray = np.reshape(results, (imx,imy,imz))

M= imx*imy*imz

MSSIM = (M**(-1))*np.sum(SSIMarray)
print('MSSIM ='+str(MSSIM))

end = time.time()
print(end - start)

If I import and use a 40*40*40 array then it takes only 8 seconds of wall time and < 3 mins of cpu time.

However importing and using a 100*100*100 array means the code doesn't finish running in over 5 hours of cpu time or 10 minutes of walltime, and not even over 1 hour of wall time.

I am guessing that there is something wrong with passing such a large array through map but I haven't been able to find a solution yet. Using CProfile the problem appears to be

method 'acquire' of 'thread.lock' objects

What is the correct way to apply a function to a large array of elements?

Note that I can't simply chop up the array into pieces as the function value for each element of the array depends on the values around that element as well.

Map() uses queues to transmit data to workers. If there is a lot of it, this will become a bottleneck and you need to redesign your code. The faster but clearly more complex way of doing this is to use multiprocessing.Process and manage the "pool" yourself. When you create a subprocess this way, data "transfer" happens on C level when your process forks and creates one to one copy of the process memory. There are many reasons why this might not be suitable, but there is little you can do to speed up queues used by Pool. — Hannu, Jan 18 '18 at 08:56
Hi @Hannu , thanks for the comment - could you please go into more detail or link to anything that could provide an example of how to use multiprocessing.process to create/manage a "pool"? — mallowcodes, Jan 21 '18 at 11:08

Optimization: correct way to pass large array to map using pathos multiprocessing

0 Answers0

Linked