NumPy memmap performance issues

Question

I have a large (75000 x 5 x 6000) 3D array stored as a NumPy memory map. If I simply iterate over the first dimension like so:

import numpy as np
import time

a = np.memmap(r"S:\bin\Preprocessed\mtb.dat", dtype='float32', mode='r', shape=(75000, 5, 6000))
l = []
start = time.time()
index = np.arange(75000)
np.random.shuffle(index)
for i in np.array(index):
    l.append(np.array(a[i]) * 0.7)
print(time.time() - start)

>>> 0.503

The iteration takes place very quickly. However, when I attempt to iterate over the same memmap in the context of a larger program, individual calls to the memmap will take as much as 0.1 seconds and pulling all 75000 records will take nearly 10 minutes.

The larger program is too long to reproduce here, so my question is: are there any known issues that can cause memmap access to slow down significantly, perhaps if there is a significant amount of data being held in Python memory?

In the larger program, the usage looks like this:

import time
array = np.memmap(self.path, dtype='float32', mode='r', shape=self.shape)
for i, (scenario_id, area) in enumerate(self.scenario_areas):
    address = scenario_matrix.lookup.get(scenario_id)
    if address:
        scenario_output = array[address]
        output_total = scenario_output * float(area)
        cumulative += output_total  # Add results to cumulative total
        contributions[int(scenario_id.split("cdl")[1])] = output_total[:2].sum()
del array

The second example takes more than 10 minutes to execute. Timing the line scenario_output = array[address], which is simply pulling the record from the memmap, varies between 0.0 and 0.5 - a half second to pull a single record.

do you need to write data back to the array immediately? `r+` may be slowing you down.. — Aaron, Mar 17 '17 at 18:05
I'm assuming your file is around 9GB? (8.4 GiB) 10 min with added processing and writeback isn't too horible (unless you're using an ssd) — Aaron, Mar 17 '17 at 18:07
The known issue is that unless the whole things fits in memory, you will be swapping/doing io. This can be extra-slow if your access is non-sequential in the storage order. — pvg, Mar 17 '17 at 18:09
But why can I iterate over the entire block in a fraction of a second in the example I gave? What's different about running the same routine in the context of a larger program? — triphook, Mar 17 '17 at 18:14
Your example loop is fast because you're not actually doing anything with the data, even reading it. Of course doing actual work is going to be slow. It's 9 gigabytes. — user2357112, Mar 17 '17 at 18:14
The difference is you're presumably accessing all of your data. In your short example, you're accessing some of it, sequentially. What is the pattern of access in your larger program? — pvg, Mar 17 '17 at 18:16
@pvg: The example isn't even accessing *any* of the data, just taking views! Taking a new view of the data doesn't require reading any of it. — user2357112, Mar 17 '17 at 18:22
It's not possible to tell from what you've added if the access is sequential. If the access is not sequential, this can get very slow. If you have few gigs of data (that don't necessarily fit in memory, but even when they do), you're best off processing it sequentially. So if you can drive your processing loop from the big blob of data, instead of looking random things up in it, performance will be significantly better. — pvg, Mar 17 '17 at 18:27
See edit: I can still make 75000 random pulls in under 5 seconds outside the larger script. — triphook, Mar 17 '17 at 18:34
Yes. scenario_matrix.lookup is a dictionary that returns an integer row number corresponding to a record in the memmap. — triphook, Mar 17 '17 at 18:38
And are the sequential values of `address` in order, or scattered all over? random disk search time could be hurting speed. — hpaulj, Mar 17 '17 at 19:44
Can you change the inner loop of your test cases to do something like a * 0.7 like your actual program? Also, you can combine all of this stuff to make your question readable, no need to add a manual edit history, one is kept anyway. — pvg, Mar 17 '17 at 23:15
You're not printing this timing data to the console on every access by any chance, are you? — pvg, Mar 20 '17 at 14:05
Not in the first example. I am in the second, only to demonstrate that the timing on that particular line can be as high as a half second. If I remove the print statement there the execution of the entire loop is still orders of magnitude higher than the first example. — triphook, Mar 20 '17 at 14:09
Ok but what sort of orders of magnitude? Also, what kind of device is the file on, how much physical memory on the machine and how many accesses does your larger program actually perform? — pvg, Mar 20 '17 at 14:10
You could use `line_profiler` and `profile` to find out where the bottlenecks are, then try to eliminate those. Repeat that until you're satisfied. — MSeifert, Mar 20 '17 at 14:11
@pvg Sorry, I removed that detail when I cleaned up the edits. The second script takes more than 10 minutes to complete. The pull itself **scenario_output = array[address]** times to zero in the first example but to as much as 0.5 seconds in the second, even though it's making the same pull from the same array. — triphook, Mar 20 '17 at 14:14
The file is on my internal hard drive. The file is 16.1 GB. The larger script isn't making any other accesses to this particular file although it does pull from many other memory-mapped files. — triphook, Mar 20 '17 at 14:16
'although it does pull from many other memory-mapped files' well. That might be a thing. If you're filling up memory and file cache with other things, you're more likely to eat hits from swapping. I assume your internal drive is an hdd. What's the actual memory footprint of your script compared to your total memory? And how many accesses are you making to the big file, it's not possible to tell from the code you've provided. — pvg, Mar 20 '17 at 14:19
The process is using 2.5 GB of 32 GB available memory. By 'accesses' do you mean how many times it's opened using the np.memmap command or how many pulls to the opened array? The former is dozens, the latter is thousands. What's strange is that occasionally, the larger script will run very quickly (at the same rate as the first example), so I assume it's related somehow to memory or other overhead issues, but it still seems like I have plenty of resources available. — triphook, Mar 20 '17 at 14:31
By accesses is i mean how many times do you access the actual data, not how many times you mmap the thing. Thousands is what, 2000? 70000 thousand? It's not that strange the script runs quickly sometimes because essentially what happens is it gets cached in memory. It should run much faster the second time you run it if you run it twice in a row. — pvg, Mar 20 '17 at 14:43
One thing worth trying for more consistent numbers is getting https://wj32.org/wp/software/empty-standby-list/ and then running `EmptyStandbyList.exe standbylist` and then measuring time, say, for your simple script. It shouldn't be 0.5 seconds in that case. — pvg, Mar 20 '17 at 15:02
can you try to reproduce it on Linux? Windows implementation of mmap is different and sometimes causes issues like [this](http://bugs.python.org/issue16743). — Marat, Mar 24 '17 at 16:30

score 2 · Answer 1 · answered Mar 23 '17 at 09:57

To the best of my knowledge, there are no restrictions attributable to memmaps in python that would be independent of general os-level restrictions. So I guess you either have a os-level memory bottleneck (possibly interactions between caching of different large mmaps) or your problem is somewhere else.

It is very good that you already have a reference implementation that shows how fast the operation should be. You will need to systematically test for different possible causes. Here are some directions that can help to identify the cause.

First, use cProfile both on the reference implementation and the to better understand where the bottleneck is. You will get a list of the function calls and the time spent in each function. This could lead to unexpected results. Some guesses:

Is it true that most of the time is spent inside the piece of code you have posted? If not, the profiling could hint to another direction.
Is self.scenario_areas list-like or is it an iterator that does some hidden and expensive calculations?
It could be that the lookup scenario_matrix.lookup.get(scenario_id) is slow. Check it.
Is contributions a regular python list or dict or does it do anything strange on assignment behind the scenes?

Only if you have verified that the time is in fact spent in the line scenario_output = array[address] would I start to hypothesize about interactions between mmap files. If this is the case, start to comment out parts of the code that involve other memory access and profile the code repeatedy to gain a better understanding of what happens.

I hope this helps.

score 0 · Answer 2 · answered Mar 26 '17 at 08:34

You probably won't be able to avoid performance issues using np.memmap,

I suggest trying something like https://turi.com/products/create/docs/generated/graphlab.SFrame.html

SFrame/SArray let you read tabular data right from the disk, which is often will be faster for large data files.

It is open source and available at https://github.com/turi-code/SFrame

NumPy memmap performance issues

2 Answers2