0

I would like to create a 2D array that can be accessed for read-only operations by multiple processes. My use case is to apply a function that takes pairs of columns from the 2D array (representing a matrix) and computes a scalar value (float64) depending on a tuple of additional arguments. A single call of function f would for a m-by-n array X with argument tuple (a1, a2) would look like f(X, (a1, a2)). In my use case I would like each process to compute f with different choices of (a1, a2).

The most promising lead I have so far is this example, but it uses a single-dimensional array. It also is using the array for non-reading purposes, but it was relevant at the time in terms of handling arrays.

I tried to modify that example by changing a = mp.Array('i', [0]*10) to a = mp.Array('i', [[0]*10]*10), but I got the following traceback:

>>> a = mp.Array('i', [[0]*10]*10)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.8/multiprocessing/context.py", line 141, in Array
    return Array(typecode_or_type, size_or_initializer, lock=lock,
  File "/usr/lib/python3.8/multiprocessing/sharedctypes.py", line 88, in Array
    obj = RawArray(typecode_or_type, size_or_initializer)
  File "/usr/lib/python3.8/multiprocessing/sharedctypes.py", line 67, in RawArray
    result.__init__(*size_or_initializer)
TypeError: an integer is required (got type list)

I checked the documentation for multiprocessing.Array (excerpt shown below), which has the argument size_or_initializer. The argument can only take an integer or a sequence, which I suppose [[0]*10]*10 is neither given the aforementioned traceback.

If size_or_initializer is an integer, then it determines the length of the array, and the array will be initially zeroed. Otherwise, size_or_initializer is a sequence which is used to initialize the array and whose length determines the length of the array.

So my main lead of using multiprocessing.Array is probably tanked. How can I build a 2D array that can be accessed (using Pool) for read-only operations?

Galen
  • 1,128
  • 1
  • 14
  • 31
  • Related (I still need to test this to see if it matches my use case): https://stackoverflow.com/questions/54320605/how-do-i-use-multiprocessing-pool-on-an-array – Galen Jul 16 '21 at 15:33

1 Answers1

0

Apparently setting a global variable is effective at making it accessible to each process, which can be setup by passing an initialization function to Pool's initializer parameter. I was concerned about passing a copy of the array to each process because it would require considerably more memory. Once this shared array is made accessible, the usual NumPy operations can be performed as desired.

Here is a constructed example that selects pairs of variables from an array (being treated as a data matrix), and calculates the correlations between those pairs of variables.

from itertools import combinations
import numpy as np
from scipy.stats import pearsonr
from multiprocessing import Pool

X = np.random.random(100000*10).reshape((100000, 10))

def function(cols):
    result = X[:, cols]
    x,y = result[:,0], result[:,1]
    result = pearsonr(x,y)
    return result

def init():
    global X 

if __name__ == '__main__':
    with Pool(initializer=init, processes=4) as P:
        print(P.map(function, combinations(range(X.shape[1]), 2)))

This example illustrates how pairwise operations on random variables can be parallelized for scalability. Some advice for those that would like to calculate something other than correlation: replace scipy.stats.pearsonr with some other function that takes two arrays as arguments and this code will perform compute that function for all pairs. Naturally, replace X = np.random.random(100000*10).reshape((100000, 10)) with code that appropriately loads your data into a 2D array representing a data matrix.

Galen
  • 1,128
  • 1
  • 14
  • 31