numpy: efficiently reading a large array

Question

I have a binary file that contains a dense n*m matrix of 32-bit floats. What's the most efficient way to read it into a Fortran-ordered numpy array?

The file is multi-gigabyte in size. I get to control the format, but it must be compact (i.e. about 4*n*m bytes in length) and must be easy to produce from non-Python code.

edit: It is imperative that the method produces a Fortran-ordered matrix directly (due to the size of the data, I can't afford to create a C-ordered matrix and then transform it into a separate Fortran-ordered copy.)

Does http://www.scipy.org/Cookbook/InputOutput answer your question? (See section on "binary files") — nimrodm, Dec 06 '10 at 11:48
@nimrodm Thanks. In fact, I've already been experimenting with some of the methods described there. I am asking the question in the hope that someone would come forward who either has first-hand experience doing what I am trying to do, or is familiar with `numpy` internals and can advise from that angle. — NPE, Dec 06 '10 at 14:06
Generally I've found when reading very large arrays into numpy that I need to know the size in advance, in order to pre-allocate the appropriate array to hold the data. Do you know the size in advance? If not, try using a two-pass approach: first scan to discover size/dimensions of data, then allocate array, then read/parse into that array. — Peter Hansen, Dec 07 '10 at 17:17
@Peter Good point, thanks. I do know the size in advance (I control the data format, so I can write out the size as part of the file header.) — NPE, Dec 07 '10 at 17:26

Sven Marnach · Accepted Answer · 2010-12-06T16:02:27.587

12

NumPy provides fromfile() to read binary data.

a = numpy.fromfile("filename", dtype=numpy.float32)

will create a one-dimensional array containing your data. To access it as a two-dimensional Fortran-ordered n x m matrix, you can reshape it:

a = a.reshape((n, m), order="FORTRAN")

[EDIT: The reshape() actually copies the data in this case (see the comments). To do it without cpoying, use

a = a.reshape((m, n)).T

Thanks to Joe Kingtion for pointing this out.]

But to be honest, if your matrix has several gigabytes, I would go for a HDF5 tool like h5py or PyTables. Both of the tools have FAQ entries comparing the tool to the other one. I generally prefer h5py, though PyTables seems to be more commonly used (and the scopes of both projects are slightly different).

HDF5 files can be written from most programming language used in data analysis. The list of interfaces in the linked Wikipedia article is not complete, for example there is also an R interface. But I actually don't know which language you want to use to write the data...

edited Dec 06 '10 at 16:02

answered Dec 06 '10 at 12:28

Sven Marnach

574,206
118
941
841

@Sven Thanks. What would you say are the advantages of using HDF5 over a simple `n*m` array for storing one large dense matrix of floats? – NPE Dec 06 '10 at 14:03
@aix: to name a few: the ability of transparently having only part of the matrix in memory of a given time, the ability of transparent compression and meta-data taking care of the matrix dimensions, eliminating an important source of erros. And since it is really easy to use, I don't see real disadvantages. – Sven Marnach Dec 06 '10 at 15:20
@Sven Could you also clarify something in your example. If I do `a = numpy.fromfile("filename", dtype=numpy.float32)` followed by `a = a.reshape((n, m), order="FORTRAN")` on a 4GB file, is this potentially going to create a 4GB "C" matrix in memory only to immediately make another 4GB in-memory copy to flip it into the Fortran format? – NPE Dec 06 '10 at 15:24
@aix - Rehshaping doesn't copy the array, it just returns a new view of it. There's no duplication of memory in that case. – Joe Kington Dec 06 '10 at 15:34
@Joe Even flipping between C and Fortran formats (row-wise vs column-wise **storage**)? How's that possible? – NPE Dec 06 '10 at 15:35
@aix - It just changes the striding of the array. It doesn't actually make it Fortran ordered in memory. – Joe Kington Dec 06 '10 at 15:40
@Joe Do you have any links for where I can read up on this? This goes contrary to my intuition and seems to completely negate the point. – NPE Dec 06 '10 at 15:42
@aix - Woops, sorry, I was wrong about that... If you change the order on reshaping, it does create a copy that actually is the order you specify. Generally speaking, though, reshaping and transposing arrays doesn't copy them. – Joe Kington Dec 06 '10 at 15:42
@Joe No problem. I know that changing the shape of a `numpy` matrix doesn't copy the data. However, for performance reasons I do care about how stuff gets stored in memory, and I can't afford to have two copies of the same multi-gigabyte matrix around, hence all my questions. – NPE Dec 06 '10 at 15:46
@aix - For what it's worth, you don't need to specify the order and create the copy. Just do `a = np.fromfile(...)` and then `a = a.reshape((m,n)).T`. This won't create a copy and has exactly the same effect. – Joe Kington Dec 06 '10 at 15:47
@Joe - sorry, you lost me again. As I stated in my question, I am looking for the matrix to be Fortran-ordered. Now, let's say for the sake of argument that `np.fromfile()` gives a `C` matrix (does it? can I change that?) Now, `reshape` changes the striding. I assume that the transpose also works by changing the striding (or am I wrong here?) So the end result is still a `C` matrix, no? – NPE Dec 06 '10 at 15:51
1

@aix - I'm assuming that you're reading in a Fortran ordered array from disk. You get a flat array that is actually a ixj Fortran ordered array. This is the same as a jxi C-ordered array, except that it's transposed. So, we reshape it as jxi, and then transpose it to be ixj. Numpy thinks it's C-ordered jxi being viewed as ixj, but that's equivalent to being directly stored as Fortran ordered ixj in memory. – Joe Kington Dec 06 '10 at 15:57
@Joe - yes, but this goes back to my original question. How do I efficiently read a binary file into a **Fotran** matrix? Are you saying that `np.fromfile()` does that? There's nothing in the documentation, and no parameters that I could find that control this. – NPE Dec 06 '10 at 16:02
@aix - It is in Fortran order in memory... The only difference is numpy's flags... If you were to access the memory buffer of the numpy array from Fortran, you'd read it direcly as a ixj array. `np.fromfile` will read it in in the order that it's stored as on disk. Like I said, I was assuming that it was already stored on disk in Fortran order. If you want to write a C-ordered array to disk as a Fortran ordered array just use `a.ravel('F').tofile(fid)`. – Joe Kington Dec 06 '10 at 16:08
@Joe Thanks for taking the time to explain this. It is now perfectly clear. I get to control the writing of the file, so I can ensure it's written in the desired order. Thanks again. – NPE Dec 06 '10 at 16:10

score 1 · Answer 2 · edited May 23 '17 at 12:07

Basically Numpy stores the arrays as flat vectors. The multiple dimensions are just an illusion created by different views and strides that the Numpy iterator uses.

For a thorough but easy to follow explanation how Numpy internally works, see the excellent chapter 19 on The Beatiful Code book.

At least Numpy array() and reshape() have an argument for C ('C'), Fortran ('F') or preserved order ('A'). Also see the question How to force numpy array order to fortran style?

An example with the default C indexing (row-major order):

>>> a = np.arange(12).reshape(3,4) # <- C order by default
>>> a
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])
>>> a[1]
array([4, 5, 6, 7])

>>> a.strides
(32, 8)

Indexing using Fortran order (column-major order):

>>> a = np.arange(12).reshape(3,4, order='F')
>>> a
array([[ 0,  3,  6,  9],
       [ 1,  4,  7, 10],
       [ 2,  5,  8, 11]])
>>> a[1]
array([ 1,  4,  7, 10])

>>> a.strides
(8, 24)

The other view

Also, you can always get the other kind of view using the parameter T of an array:

>>> a = np.arange(12).reshape(3,4, order='C')
>>> a.T
array([[ 0,  4,  8],
       [ 1,  5,  9],
       [ 2,  6, 10],
       [ 3,  7, 11]])

>>> a = np.arange(12).reshape(3,4, order='F')
>>> a.T
array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11]])

You can also manually set the strides:

>>> a = np.arange(12).reshape(3,4, order='C')
>>> a
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])
>>> a.strides
(32, 8)
>>> a.strides = (8, 24)
>>> a
array([[ 0,  3,  6,  9],
       [ 1,  4,  7, 10],
       [ 2,  5,  8, 11]])

numpy: efficiently reading a large array

2 Answers2

An example with the default C indexing (row-major order):

Indexing using Fortran order (column-major order):

The other view

You can also manually set the strides: