Getting the index of the next element in a very large memmap which satisfies a condition

Question

I have a memmap to a very large (10-100 GB) file containing current and voltage data. From a given starting index, I want to find the index of the next point for which the voltage satisfies a given condition.

In the case of a relatively small list I could do this with an iterator like so:

filename = '[redacted]'
columntypes = np.dtype([('current', '>f8'), ('voltage', '>f8')])
data = np.memmap(filename, dtype=columntypes)
current = data['current']
voltage = data['voltage']

condition = (i for i,v in enumerate(voltage) if voltage > 0.1)
print next(condition)

but because my memmap is so large, it can't build the iterator. Is there a way to do this in a pythonic way without actually loading the data into memory? I can always take the ugly approach of reading in chunks of data and looping through it until I find the index I need, but this seems inelegant.

Intuitively there is no other way - the desired element can be as well at the very end of the data. The complexity is O(n) and the only way around it is if you create an index first, so you will be able to identify the desired chunk right away. — BartoszKP, May 08 '18 at 20:41
This seems like a misunderstanding of `mmap`. `mmap` allows you to read small sections of a file on demand, as they are accessed. But unless you explicitly unload sections of the file which have been read already, then any operations (like filtering) which need to touch the whole file by definition will always end up loading the whole file in memory. And to periodically unload sections would be just as much work, probably more work, than writing some helper functions to read in chunks. — ely, May 08 '18 at 20:44

score 1 · Accepted Answer · answered May 08 '18 at 21:07

If the file has formatting in the form of line breaks (like a space/new line delimitted .csv) you can read and process line by line:

with open("foo.bar") as f:
    for line in f:
        do_something(line)

Processing the file in chunks doesn't necessarily have to be ugly using something like:

with open("foo.bar") as f:
    for chunk in iter(lambda: f.read(128), ""):
        do_something(chunk)

In your case, if you know the size of each input (current voltage pair), you can load the chunk in as raw bytes than do some conditionals on the raw data.

sizeDataPoint = 128

index = 0

lastIndex = None

with open("foo.bar") as f:
    for chunk in iter(lambda: f.read(sizeDataPoint), ""):
        if(check_conditions(chunk)):
            lastIndex = index
        index += 1

If it needs to be memory mapped, I'm not 100% sure about numpy's memmap, but I remember using a Python library called mmap (used it a long time ago) to handle very large files. If I remember correctly it does this through an OS process called "paging".

The efficacy of this attempt will depend on whether or not your OS supports it, and how well it can handle garbage collection while iterating through the file, but I think in theory it's possible to exceed Python's memory limit using mmap.

EDIT: Also, mmap large file's wont work unless you're using 64bit OS since it maps file to memory using the same address space.

Getting the index of the next element in a very large memmap which satisfies a condition

1 Answers1