63

Currently, I have some code like this

import numpy as np
ret = np.array([])
for i in range(100000):
  tmp =  get_input(i)
  ret = np.append(ret, np.zeros(len(tmp)))
  ret = np.append(ret, np.ones(fixed_length))

I think this code is not efficient as np.append needs to return a copy of the array instead of modify the ret in-place

I was wondering whether I can use the extend for a numpy array like this:

import numpy as np
from somewhere import np_extend
ret = np.array([])
for i in range(100000):
  tmp =  get_input(i)
  np_extend(ret, np.zeros(len(tmp)))
  np_extend(ret, np.ones(fixed_length))

So that the extend would be much more efficient. Does anyone have ideas about this? Thanks!

ballade4op52
  • 2,142
  • 5
  • 27
  • 42
Hanfei Sun
  • 45,281
  • 39
  • 129
  • 237

4 Answers4

74

Imagine a numpy array as occupying one contiguous block of memory. Now imagine other objects, say other numpy arrays, which are occupying the memory just to the left and right of our numpy array. There would be no room to append to or extend our numpy array. The underlying data in a numpy array always occupies a contiguous block of memory.

So any request to append to or extend our numpy array can only be satisfied by allocating a whole new larger block of memory, copying the old data into the new block and then appending or extending.

So:

  1. It will not occur in-place.
  2. It will not be efficient.
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • How about combining `linked list` with `block of memory` to provide the `extend` function? – Hanfei Sun Nov 04 '12 at 02:48
  • 3
    That's fine, but you'd then have to re-implement all the numpy methods on your linked list, while transparently hiding the fact that the underlying numpy arrays are not contiguous. Moreover, some numpy functions call functions written in C or Fortran (such as LAPACK) which take advantage of the fact that the input is a contiguous block of memory. To send your linked non-contiguous data to these functions, you'd have to allocate and copy, so again it would be inefficent. – unutbu Nov 04 '12 at 02:54
  • 6
    However, if there are no other objects in the way of enlarging a block of memory it could be efficient. I would say your answer is not complete, as the assumption that other memory allocations always block an inplace resize is not correct. In C for example you have `realloc`. What I would expect from a good answer is to state whether `numpy` also has a `realloc`-equivalent, and if not: why not? I have also often wondered why, with the tons of address space modern 64bit machines have, smart paging methods are not used more to leave virtual space between allocations for this purpose. – Herbert Apr 07 '16 at 12:27
  • 1
    @Herbert Numpy's `.resize()` does use `realloc()` internally but Numpy follows the Python convention of automatic memory management so there's no sense in directly exposing something like `realloc()`. In general, `ndarray` is tailored to large arrays with static sizes rather than dynamic arrays. Resize performance is not a major goal. A `malloc` implementation used by Numpy can use intricate allocation schemes like you suggest but that's independent of Numpy. – Praxeolitic Apr 10 '16 at 11:51
  • 2
    @Praxeolitic Your statemens are valid, but do not contradict mine: @unutbu states that a `resize` will never occur in-place and is never efficient. That is incorrect. If `realloc` is used, it *might* occur in-place and it *might* be efficient. It depends om the memory management implementation. Therefore, from a python-perspective there is no way of making the claims @unutbu did. My point about the suggested allocation scheme was that the need for a trade-off between efficiency and dynamic size has always seemed artificial to me, even though it might not be a problem for `numpy` to tackle. – Herbert Apr 11 '16 at 07:27
  • @unutbu If I understand you correctly, the challenge you describe is well-known and it is the problem of implementing dynamic arrays. It is largely solved. Python's own `list` is one such implementation, achieving amortized linear complexity for appends. The question is why NumPy doesn't provide such an implementation. – flow2k Nov 30 '20 at 06:00
23

You can use the .resize() method of ndarrays. It requires that the memory is not referred to by other arrays/variables.

import numpy as np
ret = np.array([])
for i in range(100):
    tmp = np.random.rand(np.random.randint(1, 100))
    ret.resize(len(ret) + len(tmp)) # <- ret is not referred to by anything else,
                                    #    so this works
    ret[-len(tmp):] = tmp

The efficiency can be improved by using the usual array memory overrallocation schemes.

pv.
  • 33,875
  • 8
  • 55
  • 49
16

The usual way to handle this is something like this:

import numpy as np
ret = []
for i in range(100000):
  tmp =  get_input(i)
  ret.append(np.zeros(len(tmp)))
  ret.append(np.zeros(fixed_length))
ret = np.concatenate(ret)

For reasons that other answers have gotten into, it is in general impossible to extend an array without copying the data.

Bi Rico
  • 25,283
  • 3
  • 52
  • 75
1

I came across this question researching for inplace numpy insertion methods.

While reading the answers that have been given here, it occurred to me an alternative (maybe a naive one, but still an idea): why not convert the numpy array back to a list, append whatever you want to append to it and reconvert it back to an array?

In case you have to many insertions to be done, you could create a kind of "list cache" where you would put all insertions and the insert them in the list in one step.

Of course, if one is trying to avoid at all costs a conversion to a list and back to a numpy this is not an option.

brasofilo
  • 25,496
  • 15
  • 91
  • 179
Gustavo Mirapalheta
  • 931
  • 2
  • 11
  • 25
  • 1
    This would technically work, but I think it kind of misses the point - you'd be writing nearly the same amount of steps, without achieving greater efficiency – syoels Jun 08 '20 at 07:57