Context
I am trying to load multiple .npy files containing 2D arrays into one big 2D array to process it by chunk later.
All of this data is bigger than my RAM so I am using the memmap storage/loading system here:
pattern = os.path.join(FROM_DIR, '*.npy')
paths = sorted(glob.glob(pattern))
arrays = [np.load(path, mmap_mode='r') for path in paths]
array = da.concatenate(arrays, axis=0)
No problem so far, RAM usage if very low.
Problem
Now that I have my big 2D array, I am looping through it to process data chunk by chunk as such:
chunk_size = 100_000
for i in range(0, 1_000_000, chunk_size):
subset = np.array(array[i:i+100_000])
# Process data [...]
del subset
But even if I execute this block of code without any processing, subsets
seem to be loaded into RAM indefinitely.
It is like Dask was loading or copying memmap arrays to real np.arrays behind the scenes. Deleting the variable or calling gc.collect()
did not solve this.