0

Given a binary file of numerical values, I can read it in using numpy.fromfile(). This allocates a new array for the data. Say I already have an array a and I want to read into this array. I'd had to do something like

import numpy as np

size       = 1_000_000_000
size_chunk = 1_000_000
a = np.empty(size, dtype=np.double)
with open('filename', 'rb') as f:
    tmp = np.fromfile(f, dtype=np.double, count=size_chunk)
a[:size_chunk] = tmp

where to make things general a is larger than the data read into tmp. I want to avoid the memory penalty caused by tmp by reading directly into a. Note that though

a[:size_chunk] = np.fromfile(f, dtype=np.double, count=size_chunk)

hides the tmp variable, the unnecessary temporary memory is still there.

I imagine something like

np.fromfile(f, dtype=np.double, count=size_chunk, into=a[:chunk_size])

though no such into keyword is implemented.

How can I achieve this? I'm open to using SciPy or other Python packages as well. I'll note that the H5Py package has a read_direct() which does what I want, except my data file is a raw binary and not in HDF5 format.

jmd_dk
  • 12,125
  • 9
  • 63
  • 94
  • I think you'd have to read the data yourself using `open`, `struct` and assign it to your array in a loop. As you noticed, there is no option to pass an already allocated array to `fromfile`. If memory is such an issue, you'd have to use smaller chunks. – Jan Christoph Terasa Jan 25 '21 at 15:49
  • I'm afraid doing it manually in Python using `open` and `struct` will be quite slow, compared to a NumPy/C implementation. – jmd_dk Jan 25 '21 at 15:53
  • `h5py` has a lot of `cython` code, so that `read_direct` is using lower level array access. – hpaulj Jan 25 '21 at 16:36
  • @jmd_dk Maybe `numba` can speed that up, or you have to extend the `numpy` function yourself (and if you fancy try to the PR into upstream). – Jan Christoph Terasa Jan 25 '21 at 22:08
  • @JanChristophTerasa Numba cannot help with the many invocations of `struct`. I think the right thing to do is to write it in C/Cython and provide a Python wrapper. I'm just sad that NumPy doesn't allow me to provide the memory buffer, as it already has the underlying efficient read implemented. – jmd_dk Jan 26 '21 at 07:11

1 Answers1

0

I was reading about the buffer protocol and it mentions readinto; there are several questions on SO for this kind of problem,e.g 1 and some of them suggest the use of readinto.


Original answer: while this should be possible with a custom C extension it is also an overkill.

I don't think this is possible with numpy alone and you would have to do your own C extension. I have looked over the numpy reference so there might be something I missed but by design python would allocate memory for your buffer and if the numpy developers respect this design choice then there's not much to do other than writing you own C extension to support this very case.

edoput
  • 1,212
  • 9
  • 17