Numpy append: Automatically cast an array of the wrong dimension

Question

is there a way to do the following without an if clause?

I'm reading a set of netcdf files with pupynere and want to build an array with numpy append. Sometimes the input data is multi-dimensional (see variable "a" below), sometimes one dimensional ("b"), but the number of elements in the first dimension is always the same ("9" in the example below).

> import numpy as np
> a = np.arange(27).reshape(3,9)
> b = np.arange(9)
> a.shape
(3, 9)
> b.shape
(9,)

this works as expected:

> np.append(a,a, axis=0)
array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8],
   [ 9, 10, 11, 12, 13, 14, 15, 16, 17],
   [18, 19, 20, 21, 22, 23, 24, 25, 26],
   [ 0,  1,  2,  3,  4,  5,  6,  7,  8],
   [ 9, 10, 11, 12, 13, 14, 15, 16, 17],
   [18, 19, 20, 21, 22, 23, 24, 25, 26]])

but, appending b does not work so elegantly:

> np.append(a,b, axis=0)
ValueError: arrays must have same number of dimensions

The problem with append is (from the numpy manual)

"When axis is specified, values must have the correct shape."

I'd have to cast first in order to get the right result.

> np.append(a,b.reshape(1,9), axis=0)
array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8],
   [ 9, 10, 11, 12, 13, 14, 15, 16, 17],
   [18, 19, 20, 21, 22, 23, 24, 25, 26],
   [ 0,  1,  2,  3,  4,  5,  6,  7,  8]])

So, in my file reading loop, I'm currently using an if clause like this:

for i in [a, b]:
    if np.size(i.shape) == 2:
        result = np.append(result, i, axis=0)
    else:
        result = np.append(result, i.reshape(1,9), axis=0)

Is there a way to append "a" and "b" without the if statement?

EDIT: While @Sven answered the original question perfectly (using np.atleast_2d()), he (and others) pointed out that the code is inefficient. In an answer below, I combined their suggestions and replaces my original code. It should be much more efficient now. Thanks.

Sven Marnach · Accepted Answer · 2011-04-20T12:23:14.193

3

You can use numpy.atleast_2d():

result = np.append(result, np.atleast_2d(i), axis=0)

That said, note that the repeated use of numpy.append() is a very inefficient way to build a NumPy array -- it has to be reallocated in every step. If at all possible, preallocate the array with the desired final size and populate it afterwards using slicing.

edited Apr 20 '11 at 12:23

answered Apr 20 '11 at 12:09

Sven Marnach

574,206
118
941
841

thanks for the fast answer. Interesting, I didn't know the atleast_2d() method, but it seems to work. However, I think you meant np.atleast_2d(i)? Concerning the allocation, I do not know the final size, can I still do something to reduce the inefficiency? – Sebastian Apr 20 '11 at 12:15
1

If performance matters, you can first collect all arrays you want to join in a Python list, compute the final size of your array, and than allocate and populate the array. (Just ask a new question if there are any problems with this approach.) – Sven Marnach Apr 20 '11 at 12:25
@Sven, re: storing the arrays. This is prob better than my suggestion in most situations (assuming there is sufficient memory to store all the arrays twice). – Henry Gomersall Apr 20 '11 at 12:33
@Sven @Henry, thanks very much for your help and tips. If I encounter a speed problem I'll try the python list approach. ciao! – Sebastian Apr 20 '11 at 12:56
2

@Sebastian: Just another thought - maybe a call to `numpy.vstack()` after creating the list as described above is your best guess. – Sven Marnach Apr 20 '11 at 13:11

score 2 · Answer 2 · answered Apr 20 '11 at 14:56

2

You can just add all of the arrays to a list, then use np.vstack() to concatenate them all together at the end. This avoids constantly reallocating the growing array with every append.

|1> a = np.arange(27).reshape(3,9)

|2> b = np.arange(9)

|3> np.vstack([a,b])
array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8],
       [ 9, 10, 11, 12, 13, 14, 15, 16, 17],
       [18, 19, 20, 21, 22, 23, 24, 25, 26],
       [ 0,  1,  2,  3,  4,  5,  6,  7,  8]])

answered Apr 20 '11 at 14:56

Robert Kern

13,118
3
35
32

Yes this seems to work, also @Sven recommended that. The reason why I shouldn't call `result = np.vstack(result,i)` in my for loop is inefficiency, right? – Sebastian Apr 21 '11 at 07:48

Sebastian · Answer 3 · 2011-04-22T07:03:48.783

I'm going to improve my code with the help of @Sven, @Henry and @Robert. @Sven answered the question, so he earns the reputation for this question, but - as highlighted by him and others -there is a more efficient way of doing what I want.

This involves using a python list, which allows appending with a performance penalty of O(1) whereas numpy.append() has a performance penalty of O(N**2). Afterwards, the list is converted to a numpy array:

Suppose i is either of type a or b:

> a = np.arange(27).reshape(3,9)
> b = np.arange(9)
> a.shape
(3, 9)
> b.shape
(9,)

Initialise list and append all read data, e.g. if data appear in order 'aaba'.

> mList = []
> for i in [a,a,b,a]:
     mList.append(i)

Your mList will look like this:

> mList
[array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8],
   [ 9, 10, 11, 12, 13, 14, 15, 16, 17],
   [18, 19, 20, 21, 22, 23, 24, 25, 26]]),
 array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8],
   [ 9, 10, 11, 12, 13, 14, 15, 16, 17],
   [18, 19, 20, 21, 22, 23, 24, 25, 26]]),
 array([0, 1, 2, 3, 4, 5, 6, 7, 8]),
 array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8],
   [ 9, 10, 11, 12, 13, 14, 15, 16, 17],
   [18, 19, 20, 21, 22, 23, 24, 25, 26]])]

finally, vstack the list to form a numpy array:

> result = np.vstack(mList[:])
> result.shape
(10, 9)

Thanks again for valuable help.

There's no need for `np.atleast_2d()` when using `np.vstack()` but also little harm. — Robert Kern, Apr 21 '11 at 15:28
@Robert oh, you're right. Excellent, that makes it even more compact. I'll change my answer. — Sebastian, Apr 22 '11 at 07:01

Henry Gomersall · Answer 4 · 2011-04-20T12:29:01.547

0

As pointed out, append needs to reallocate every numpy array. An alternative solution that allocates once would be something like this:

total_size = 0
for i in [a,b]:
    total_size += i.size

result = numpy.empty(total_size, dtype=a.dtype)
offset = 0
for i in [a,b]:
    # copy in the array
    result[offset:offset+i.size] = i.ravel()
    offset += i.size

# if you know its always divisible by 9:
result = result.reshape(result.size//9, 9)

If you can't precompute the array size, then perhaps you can put an upper bound on the size and then just preallocate a block that will always be big enough. Then you can just make the result a view into that block:

result = result[0:known_final_size]

edited Apr 20 '11 at 12:29

answered Apr 20 '11 at 12:23

Henry Gomersall

8,434
3
31
54

nice snippet, thanks. This, however, would require me to open each of the files (several hundreds), two times, right? First, to acquire the size (loop 1) and secondly to extract the data (loop 2). Storing the file(pointer?) in memory is probably not efficient either (but I'm by no means able to judge that). What do you think? – Sebastian Apr 20 '11 at 12:31
I'm not familiar with the exact libraries in question, so forgive me if the answer isn't completely transferable. If the point is that you're loading a file each time, can you get `total_size` from the file metadata (certainly, the file size will put an upper bound on the array size, assuming its not compressed)? Regarding the difference in time, just measure it! (the timeit module makes it very simple). Incidentally, the easiest way to solve the loading of the file multiple times problem is to do what @Sven suggests in his comments to his post and just load all the arrays into memory. – Henry Gomersall Apr 20 '11 at 12:46

Numpy append: Automatically cast an array of the wrong dimension

4 Answers4

Linked