I have a large (75000 x 5 x 6000) 3D array stored as a NumPy memory map. If I simply iterate over the first dimension like so:
import numpy as np
import time
a = np.memmap(r"S:\bin\Preprocessed\mtb.dat", dtype='float32', mode='r', shape=(75000, 5, 6000))
l = []
start = time.time()
index = np.arange(75000)
np.random.shuffle(index)
for i in np.array(index):
l.append(np.array(a[i]) * 0.7)
print(time.time() - start)
>>> 0.503
The iteration takes place very quickly. However, when I attempt to iterate over the same memmap in the context of a larger program, individual calls to the memmap will take as much as 0.1 seconds and pulling all 75000 records will take nearly 10 minutes.
The larger program is too long to reproduce here, so my question is: are there any known issues that can cause memmap access to slow down significantly, perhaps if there is a significant amount of data being held in Python memory?
In the larger program, the usage looks like this:
import time
array = np.memmap(self.path, dtype='float32', mode='r', shape=self.shape)
for i, (scenario_id, area) in enumerate(self.scenario_areas):
address = scenario_matrix.lookup.get(scenario_id)
if address:
scenario_output = array[address]
output_total = scenario_output * float(area)
cumulative += output_total # Add results to cumulative total
contributions[int(scenario_id.split("cdl")[1])] = output_total[:2].sum()
del array
The second example takes more than 10 minutes to execute. Timing the line scenario_output = array[address], which is simply pulling the record from the memmap, varies between 0.0 and 0.5 - a half second to pull a single record.