1

EDIT: I've gotten a lot of useful feedback on how not to do this and how to find alternatives, but making that useful depends on idiosyncrasies of my use case that would make this question less useful to others. At this point, I'm not looking for alternatives to using data structured like this. I'm looking for why it seems to be impossible to do this in numpy (or how to do it if it's not impossible)

I have a numpy array, which looks like

a = array([list([1]), list([4, 5])], dtype=object)

I want to append a list like

b = [2, 3, 4]

To get a result like

array([list([1]), list([4, 5]), list([2, 3, 4])], dtype=object)

However, every method I've tried has produced:

array([list([1]), list([4, 5]), 2, 3, 4], dtype=object)

I've tried vstack, concatenate, and append, as well as wrapping things in lists or ndarrays.

Why am I doing this? Basically, I have a lot of data in an ndarray that's going to get fed into sklearn. I want to have a 3d ndarray (data sets x data points x features) but incoming data is bad and certain things have different lengths, so the innermost dimension has to be lists. I'm trying to append a derived feature, which keeps failing. I've managed to reorder the operations to avoid needing to do this appending, but I still want to know how to do it. This seems like an odd failure for numpy. edit: In short, the outer array must be an ndarray because it's actually 2d, and complex slicing is frequently used, while the append operation occurs very few times.

Owen Gray
  • 27
  • 3
  • I can do it by a.append("temp"); a[-1]=b, but that seems hacky and bad. It feels like there should be a way to do this. – Owen Gray Jul 06 '18 at 13:11
  • In a one liner, you could do: `arr = np.array(a.tolist()+[b])` – Brenlla Jul 06 '18 at 15:16
  • @Brenlla That would work, even with the outer ndarray being 2d. However, I'm a bit concerned about the speed of the double conversion, given that these arrays actually have thousands of elements. It also seems kind of ugly and unfortunate. I was really hoping for a numpy method to do this, or an explanation of why the builtins fail. I'd definitely upvote this as an answer though. – Owen Gray Jul 06 '18 at 15:27
  • There is another *hacky* way: `np.concatenate((a,np.array((b,[]))))[:-1]` – Brenlla Jul 06 '18 at 15:37
  • @Brenlla I think this is actually much better cost-wise, as I think this just constructs one tiny ndarray, then performs the concatenation, and then returns a view. Still hacky though. – Owen Gray Jul 06 '18 at 15:41
  • 3
    @OwenGray. Appending to an array is a bad idea. Don't worry about hackishness and cost. – Mad Physicist Jul 06 '18 at 15:41
  • 1
    Agree with @MadPhysicist. If you're concerned about performance, don't use NumPy arrays for mismatched-length lists. If you *must*, use padding and create a normal `m x n` array. – jpp Jul 06 '18 at 15:43
  • @MadPhysicist I think it's kind of unavoidable here though—even in my reordering that avoids this, I still end up appending. Basically, I have an array of (features x data points) and then add one new feature for each data point computed from existing features. Not sure how to do that without appending. – Owen Gray Jul 06 '18 at 15:44
  • @jpp is it really that bad? I thought object ndarrays just stored references, and my only accesses to it involve iterating through (first to compute the thing I'm appending, and then later to cut all sublists to the same length and convert to a 3d array), which should be just as efficient as using a list and lets me use numpy multidimensional indexing. – Owen Gray Jul 06 '18 at 15:47
  • `ndarrays just stored references`. You've just described `list`. It's exactly what NumPy arrays don't do with non-`object` dtype. Read [Why NumPy instead of Python lists?](https://stackoverflow.com/questions/993984/why-numpy-instead-of-python-lists) – jpp Jul 06 '18 at 15:48
  • @jpp yes, but the outer array (which above is unidimensional for a simple example) is multidimensional in my use case, and if I want to slice it by the second index the outer array needs to be an ndarray. I'm aware of why ndarrays of _primitives_ are more efficient than lists, but I don't think ndarrays of _objects_ are _less_ efficient than lists, and they also have better slicing. – Owen Gray Jul 06 '18 at 15:58
  • @Owen. I've proposed a hybrid approach to what jpp is suggesting. On mobile, so everything is slow – Mad Physicist Jul 06 '18 at 15:59
  • You're best off just preallocating a full array and using nans or something though – Mad Physicist Jul 06 '18 at 16:02
  • 1
    iterating on a list is faster than iterating on an object array. list append is also faster. Can sklearn use an object array? – hpaulj Jul 06 '18 at 16:18
  • I think it would be better to just have a list of arrays (rather than an array of lists). At least this way you would be storing references to *contiguous* chunks of memory – Brenlla Jul 06 '18 at 16:19

3 Answers3

1

Appending to an array in the first place is an expensive and generally smelly operation. The thing is that the contents of the array may be mutable, but the address of the underlying buffer is not. Every time you append an element, the whole thing gets reallocated and copied. As far as I'm aware, there isn't even an attempt at amortization, as with list.

If you are up for a slightly different approach, I would recommend maintaining your data in a list as you have now. You just transform your list into an array whenever you actually need the array. Remember that this is cheaper than reallocating to a new array every time, and you probably won't have to do it often compared to the number of appends:

stack = [[1], [4, 5]]
a = np.array(stack, dtype=np.object)
# do stuff to the array

...

stack.append([2, 3, 4])
a = np.array(stack, dtype=np.object)

Update Now that I Understand Your Question

If your goal is just to figure out how to append an element to an object array without having the fact that it is a list get in your way, you have to first create an array or element that is empty. Rather than trying to coerce the type with fake elements as some of the comments suggest, I recommend just creating empty elements and setting them to your list explicitly. You can wrap the operation in a function if you want to have a clean interface.

Here is an example:

b = [2, 3, 4]
c = np.empty(1, dtype=np.object)
c[0] = b
a = np.concatenate((a, c))

OR

a = np.append(a, c)

Of course this is not as clean as np.array([b], dtype=np.object), but that's just an artifact of how numpy processes arrays. The reason you pretty much have to do it like this is that numpy treats anything that is a list or tuple as a special item that you want to convert into an array at the outer level.

Mad Physicist
  • 107,652
  • 25
  • 181
  • 264
  • This could work for many cases, but in my situation the outer ndarray is 2d, and the overall data structure is 3d. I'm also only appending once, and slicing by the second index many more times. I don't think this approach would work for me. I'm also concerned about the efficiency of this approach: I think list -> ndarray conversion is at least O(n). I've also edited my question to clarify that I'm more trying to make the question useful for others than trying to find the best way to handle my situation. – Owen Gray Jul 06 '18 at 16:14
  • @OwenGray. I hope my update answers your actual question. – Mad Physicist Jul 06 '18 at 18:25
  • 1
    The update here answers the 'why cant I'. `concatenate` joins arrays. The new piece has to be in an array of the right dtype and shape - object and (1,). Starting with `np.empty` is best if not only way to do this. – hpaulj Jul 06 '18 at 19:50
  • @hpaulj. Well, you could technically use `np.zeros` or `np.ones`, but that would just be wasteful. – Mad Physicist Jul 06 '18 at 20:04
  • Object empty fills with `None` so isn't as cheap as the numeric version. But I like that fill. – hpaulj Jul 06 '18 at 20:15
0

If you really must have an np.ndarray with dtype=object, you can do this:

a = np.array([list([1]), list([4, 5])], dtype=object)
b = [2, 3, 4]
a = np.hstack((a, np.empty(1)))
a[-1] = b

(Or of course remove np. in your case where you fully imported numpy.)
But I recommend not using np.ndarrays of dtype=object. Instead use lists with:

a = [[list([1]), list([4, 5])]]
b = [2, 3, 4]
a.append(b)

Now if you really want to have a as an np.ndarray, you can then do the following:

a = np.array(a)
JE_Muc
  • 5,403
  • 2
  • 26
  • 41
  • 2
    Why the down votes? This works as well as @Mad's update. – hpaulj Jul 06 '18 at 19:55
  • As well as things like unaccepting a correct answer after some time (days or weeks), because some follow up problems show up, which are not described in the initial question and which are not directly connected to the initial problem. Like [here](https://stackoverflow.com/questions/50328737/access-column-with-in-another-column-header/50329298#50329298). – JE_Muc Jul 09 '18 at 11:41
  • Or not accepting the correct answer, like [here](https://stackoverflow.com/questions/51095427/numba-jit-doesnt-allow-the-use-of-np-argsort-in-nopython-mode/51097713#51097713) and [here](https://stackoverflow.com/questions/51210686/attributeerror-list-object-has-no-attribute-t/51210739#51210739). This quite much kills my motivation to help people on SO. :( – JE_Muc Jul 09 '18 at 11:41
  • 2
    I would guess the downvotes are because the first solution you gave was already in a comment I left on the question (before you answered), and the other option you suggested is pretty obvious. – Owen Gray Jul 09 '18 at 19:48
  • 1
    No, that is not true. You do realize that the solution you provided in your first example raises `AttributeError: 'numpy.ndarray' object has no attribute 'append'`? Thus the solution in your comment does not work for the case you provided. And thus I added my first solution **which avoids the AttributeError**. Furthermore imho it is not ok to downvote a solution, just because it is obvious. It is a solution, it works and it was not in the minimum working example. If downvoting obvious solutions was the way to go on SO, 99% of the answers would be downvoted. – JE_Muc Jul 10 '18 at 08:41
  • 1
    Besides that Mad Physicist **did not get a downvote**, even though it was **also featuring the "obvious" solution**, and even though he posted his solution after mine. Imho both should be upvoted, because they **both solve the question**. Downvoting only one of them (and especially the one which solved it in the first place, while the second answer is more or less a copy of it) seems kind of strange to me. Would you like it if someone downvoted your answer, even though it solves the question? I am really sorry if I do not understand your motivation, but I will willingly listen to your opinion. – JE_Muc Jul 10 '18 at 08:54
0

Time passed, but maybe someone will make use of that (Python 3.9, NumPy 1.23).

I've had the same problem. The easiest solution I've found is to append one element to an ndarray (as a placeholder, in other words as an array extender), then assign the list to the last element of the extended array.

a_list = [1, 2, 3]
an_array = np.ones(10, dtype=object)
an_array = np.append(an_array, 0)
an_array[-1] = a_list

I think it has the smallest performance impact because the temporary array isn't created.

EDIT: I saw that JE_Muc's solution is almost the same as mine.

DeZee
  • 95
  • 6