So here are my observations after a gruelling bout with the task:
- The final array you're supposed to obtain as output, is 4-D and has the shape
(2000, 2000, 76, 76)
and is made of float64
type values. A crude calculation shows that the size of this array is: 2000*2000*76*76*8 bytes = ~170 GBs... so you definitely cannot hold all of it in memory at once.
- Usage of
multiprocessing
is complicated (has always been for someone who has not studied multiprocessing rigorously) and it yields not-so-good computation times. For instance, on Google Colab, (Tesla T4 GPU backend, 12GB RAM) N = 50
takes ~4.5 s (minimum) to run. A better implementation with the module may be possible, but I'm not the one for it.
My course of action:
To tackle the second issue, I use cupy
, which is supposed to be a drop-in replacement for numpy
in Python. By drop-in replacement, it means you can replace numpy
with cupy
everywhere in your code (there are exceptions - irrelevant to this problem). cupy
however uses CUDA on Nvidia GPUs, and thus you need to make a CUDA installation before doing a cupy
installation. (Check this guide.) Or, if it is possible, you may prefer online computing resources, like I do with Google Colab.
I also divide the work into parts. I use a function fnh(a, scale, N)
to compute the scaled-tiled arrays for arbitrary N
.
I slice the intended output array into multiple parts, and iteratively run fnh(...)
on these slices. The slicing can be tuned for better optimisation, but I've just used something based on a crude guess.
Here is the code:
import cupy as cp
def fnh(a, scale, N):
arr = cp.einsum('i,ijk->ijk', scale, a)
result = cp.tile(arr, (N, 1, 1, 1))
del arr
return result
def slicer(arr, scales, N = 400):
mempool = cp.get_default_memory_pool()
pinned_mempool = cp.get_default_pinned_memory_pool()
# result = np.empty((N, 2000, 76, 76)) # to large to be allocated
section = 500 # Choices subject
parts = 80 # to optimization
step = N // parts
for i in range(parts): # Slice N into equal parts
begin = i*step
end = begin + step
stacked = cp.empty((step, 2000, 76, 76))
for j in range(2000 // section): # Section the 2000 arrays into equal parts
begin = j*section
end = begin + section
s = scales[begin:end]
a = arr[begin:end]
res = fnh(a, s, step)
stacked[:, begin:end] = res # Accumulate values
del a, res
# result[begin:end] = stacked # This is where we were supposed to
# accumulate values in result
del stacked
mempool.free_all_blocks()
pinned_mempool.free_all_blocks()
First of all, I use cupy.einsum
to compute the scaling vectorially on the arrays.
Second, I delete arrays wherever possible to recover space. Specifically, it is imperative to deallocate space alloted by cupy
in the GPU memory pool using mempool.free_all_blocks()
and pinned_mempool.free_all_blocks()
, to recover available GPU memory. Read about it here. However, since cupy
caches the allocated memory, it might(?) be helpful to use this caching in a limited manner for some speedup. (This is a hunch and I'm not particularly informed about it.) So I use the same memory for the sectioned tiles, and clear it after completing a N-slice.
Third, where # result[begin:end] = stacked
is, you're supposed to off-load the array somehow; because you cannot afford to have the entire array in memory, as previously mentioned. Offloading to some bin where and how you see fit to your application is probably a good way to avoid the memory problem.
Fourth, this code is incomplete. This is because the formed arrays need proper handling, as previously mentioned. But it does the major heavy-lifting.
Lastly, to time this code using timeit
, in Google Colab:
For comparison, N = 50
takes ~50 ms (minimum) and N = 2000
takes ~7.4 s (minimum) to run.
UPDATE: Changing to parts = 40
and section = 250
brings down the minimum time to ~6.1 s.
Well, I'm sure there will be better ways to write this code, and I'm looking forward to it!