1

I want to create a numpy array from a binary file using np.fromfile. The file contains a 3D array, and I'm only concerned with a certain cell in each frame.

x = np.fromfile(file, dtype='int32', count=width*height*frames)
vals = x[5::width*height]

The code above would work in theory, but my file is very large and reading it all into x causes memory errors. Is there a way to use fromfile to only get vals to begin with?

threnna
  • 51
  • 1
  • 7
  • If you pass a file, not a string for the first parameter, you can simply use the `count` keyword to read the file in manageable chunks. – Paul Panzer Feb 23 '17 at 20:20
  • Count lets you read the first N elements, but it won't help you load every nth element. Files are serial storage. Reading every nth item to the end still requires reading the file to the end. – hpaulj Feb 23 '17 at 20:22
  • @hpaulj yes, but on the smaller chunks OP can just use their posted code. If the decimated result fits in memory I don't see why this shouldn't work. Or am I missing something here? – Paul Panzer Feb 23 '17 at 20:29
  • `fromfile` doesn't have any further parameters. If `x` is too big fit, then he can't select `vals`. How about a memmory map on the file? I don't know if the `tofile` format is compatible with that or not. Maybe the `np.save/load` pair would be better and more flexible. – hpaulj Feb 23 '17 at 20:43

1 Answers1

0

This may be horribly inefficient but it works:

import numpy as np

def read_in_chunks(fn, offset, step, steps_per_chunk, dtype=np.int32):
    out = []
    fd = open(fn, 'br')
    while True:
        chunk = (np.fromfile(fd, dtype=dtype, count=steps_per_chunk*step)
                 [offset::step])
        if chunk.size==0:
            break
        out.append(chunk)
    return np.r_[tuple(out)]

x = np.arange(100000)
x.tofile('test.bin')
b = read_in_chunks('test.bin', 2, 100, 6, int)
print(b)

Update:

Here's one that uses seek to skip over the unwanted stuff. It works for me, but is totally undertested.

def skip_load(fn, offset, step, dtype=np.float, n = 10**100):
    elsize = np.dtype(dtype).itemsize
    step *= elsize
    offset *= elsize
    fd = open(fn, 'rb') if isinstance(fn, str) else fn
    out = []
    pos = fd.tell()
    target = ((pos - offset - 1) // step + 1) * step + offset
    fd.seek(target)
    while n > 0:
        if (fd.tell() != target):
            return np.frombuffer(b"".join(out), dtype=dtype)
        out.append(fd.read(elsize))
        n -= 1
        if len(out[-1]) < elsize:
            return np.frombuffer(b"".join(out[:-1]), dtype=dtype)
        target += step
        fd.seek(target)
    return np.frombuffer(b"".join(out), dtype=dtype)
Paul Panzer
  • 51,835
  • 3
  • 54
  • 99