1

I have a very large diagonal matrix that I need to split for parallel computation. Due to data locality issues it makes no sense to iterate through the matrix and split every n-th calculation between n threads. Currently, I am dividing k x k diagonal matrix in the following way but it yields unequal partitions in terms of the number of the calculations (smallest piece calculates a few times longer than the largest).

def split_matrix(k, n):
    split_points = [round(i * k / n) for i in range(n + 1)] 
    split_ranges = [(split_points[i], split_points[i + 1],) for i in range(len(split_points) - 1)]
    return split_ranges

import numpy as np
k = 100
arr = np.zeros((k,k,))
idx = 0
for i in range(k):
    for j in range(i + 1, k):
        arr[i, j] = idx
        idx += 1


def parallel_calc(array, k, si, endi):
     for i in range(si, endi):
         for j in range(k):
             # do some expensive calculations

for start_i, stop_i in split_matrix(k, cpu_cnt):
     parallel_calc(arr, k, start_i, stop_i)

Do you have any suggestions as to the implementation or library function?

sophros
  • 14,672
  • 11
  • 46
  • 75
  • With all due respect, **what tools are used** to make Cpython `parallel_calc()` function inside a `[SEQ]`-`for`-loop get executed in a true-`[PARALLEL]` fashion? + What are the actual scales ~ could you translate a term "very large" into a scientific notation `1E+XYZ ` and confirm the actual desired datatype ~ `{ uint8 | ... |` **`float64`** `| complex128 }`? – user3666197 Sep 15 '17 at 12:10
  • `arr` is float32, in range of _at least_ 10**10 and `np.memmap`-ped. `parallel_calc` is run in pool of processes from `multiprocessing`. – sophros Sep 15 '17 at 12:24
  • 2
    1E+10 with a process-pool-fileIO? **How long does the "expensive calculation" take { min | avg | max } [ms] and where goes the process output into ( another fileIO or socket or InRAM or ... )?** – user3666197 Sep 15 '17 at 12:28
  • Results get stored in the array. The "expensive calculation" takes in the range of 3-50 ms using also separate per-process instance of additional data. The data used for calculation is in the small 'surroundings' of the item being updated as a result of the calculation. – sophros Sep 15 '17 at 12:35
  • 1
    You ask more than you give, man. Noticed the additional ( separate ) data the concurrent processes access -- are these InRAM or, again, fileIO? Are "surroundings" convolutional ( propagating the just updated values for correct computation of any next / surrounding element ) or independent ( a snapshot processing )? How do you plan to process the small-"kernel"-surroundings near the `arr` "split" boundaries, if "kernel"-footprints step beyond the split-edge? Does the `arr` remain diagonal until all the processing terminates or is it subject of any refills into non-diagonal elements during calc? – user3666197 Sep 15 '17 at 13:00
  • Let's assume it is all InRAM, no fileIO. The calculations can be done independently within individual processes and I do not want to propagate between processes - it does v. little harm (from this perspective the problem is almost 100% ideally concurrent). The largest and the most problematic is the array that I do not want to keep InRAM in core number of copies (1), and actually it does not even fit 1 copy in memory (2 - hence the memmapped ndarray). (1) has been resolved by reopening the file with offset in every process ,(2) I was able to solve now with the code below. Thanks for the help! – sophros Sep 15 '17 at 13:31

2 Answers2

1

After a number of geometrical calculations on a side I arrived at the following partitioning that gives roughly the same number of points of the matrix in each of the vertical (or horizontal, if one wants) partitions.

def offsets_for_equal_no_elems_diag_matrix(matrix_dims, num_of_partitions):
    if 2 == len(matrix_dims) and matrix_dims[0] == matrix_dims[1]:  # square
        k = matrix_dims[0]
        # equilateral right angle triangles have area of side**2/2 and from this area == 1/num_of_partitions * 1/2 * matrix_dim[0]**2 comes the below
        # the k - ... comes from the change in the axis (for the calc it is easier to start from the smallest triangle piece)
        div_points = [0, ] + [round(k * math.sqrt((i + 1)/num_of_partitions)) for i in range(num_of_partitions)]
        pairs = [(k - div_points[i + 1], k - div_points[i], ) for i in range(num_of_partitions - 1, -1, -1)]
        return pairs
sophros
  • 14,672
  • 11
  • 46
  • 75
  • sorry to note: given the facts posted in the comments above, your processing-approach is devastating the performance and is extending ( instead of masking ) both the fileIO-latencies and the fileIO-bandwidth limits. If your Senior Software Engineering and R&D practice is not to suffer from this approach, rather hire an HPC professional to cut the process-duration and increase the processing performance. **Using a set of several concurrent fileIO-maps, that are just-block-"offset"-`memmap`-s onto the same file is the worst idea to use if trying to indeed increase the processing performance.** – user3666197 Sep 15 '17 at 13:49
  • Thank you for your opinion but I dare to disagree. That would be true if the bottleneck was on fileIO. Otherwise, I am striving to find the easiest solution given the problem at hand. fileIO proved to be not a problem on the hardware I am using given the sequential processing within processes and the large chunks numpy.memmap uses to access the file. This is good enough for the PoC at hand. For a repetitive HPC-level processing I would indeed ask for thorough analysis someone with in-depth knowledge in that area. – sophros Sep 15 '17 at 13:59
  • Negative, if both of your quantitative details above were both correct, the **process** as proposed **takes ~ 347 .. 5787 days** for the said { 3 .. 50 } [ms] per-shot, so even the PoC is not an excuse for such a poor concept. Your problem is solvable, even in spite of your assurances of being "near"-local ( while looping over the full span of `1E+10` in `k`-inner loop ), from the very begining ( the detailed profiling will show you, what are CPU-cores actually waiting for ( and **how much additional multicore-blocking** injected the O/S + **`cpu_cnt`-multiple-of-`np.memmap()`**-s IO-ops ) – user3666197 Sep 15 '17 at 15:58
  • I ran a smaller test with 86765**2 matrix (~ 7.5 * 10^9) and it took in total 481268s (+33%) vs. 361892s when additional storage for separate matrices was used. Since I almost run out of space when using separate matrices I believe I can accept the trade-off for the time being. This results proves the point that _user3666197_ was making about the access to the same file. Using separate memmapped matrices of size calculated with the function above would allow me to maintain the the benefit of an almost equal split of workload between CPUs and limit the impact on storage. – sophros Sep 20 '17 at 11:15
  • Additionally, I have to correct the per-calculation time - on avg it is v. close to 0.5 ms instead of 3-50ms observed on smaller examples (likely setup operations weighted on the average). With the splitting function above the differences in calculation time per-core were small (-1.18 to +1.36% of ideal split). – sophros Sep 20 '17 at 11:19
0

I thin you should update your split_matrix method, as it returns one split range less, than you want (setting cpu_cnt=4 will return only 3 tuples, and not 4):

def split_matrix(k, n):
    split_points = [round(i * k / n) for i in range(n+1)] 
    return [(split_points[i], split_points[i + 1],) for i in range(len(split_points) - 1)]

Edit: If your data locality is not so string you could try this: create a queue of tasks, in which you add all indices/entries for which this calculation shall be performed. Then you initialize your parallel workers (e.g. using multiprocessing) and let them start. This worker now pick a element out of the queue, calculate the result, store it (e.g. in another queue) and continue with the next item, and so on.

If this is not working for your data, I don't think, that you can improve anymore.

zimmerrol
  • 4,872
  • 3
  • 22
  • 41