Efficient way to load big heterogeneous dataset in a numpy array?

Question

I have to load a dataset into a big array with p instances where each instance has 2 dimensions (n_i, m). The length of the first dimension n_i is variable.

My first approach was to pad all the instances to the max_len over the first dimension, initialize an array of size (p, max_len, m) and then broadcast each instance into the big array as follows big_array[i*max_len:i*max_len+max_len] = padded_i_instance. This is fast and works well, the problem is that I only have 8Gb of RAM and I get (interrupted by signal 9: SIGKILL) error when I try to load the whole dataset. It also feels very wasteful since the shortest instance is almost 10 times shorter than the max_len so some instances are 90% padding.

My second approach was to use np.vstack and then build the big_array iteratively. Something like this:

big_array = np.zeros([1,l])
for i in range(1,n):
    big_array = np.vstack([big_array, np.full([i,l], i)])

this feels less "wasteful" but it actually takes 100x longer to execute for only 10000 instances, it is unfeasible to use for 100k+.

So I was wondering if there was a method that is both more memory efficient than approach 1 and more computationally efficient than approach 2. I read about np.append and np.insert but they seem to be other versions of np.vstack so I assume the would take roughly as much time.

you should be able to assign values to the big array without the padding. Alternatively, do `vstack` just once on the whole list of arrays. — hpaulj, Dec 09 '21 at 16:02
how? The only way that the broadcast will succeed for each instance is if they all have the same shape and the intialized array used as "container" has also that shape in the second dimension. Good idea, I will try doing vstack only once to see how much faster it is. — Samuel Rodríguez, Dec 09 '21 at 16:18
I was thinking of assigning slices in a 2d array: `big_array[i:i+n, :] = small_array` where `small_array` has `n` rows. — hpaulj, Dec 09 '21 at 16:41

score 0 · Accepted Answer · answered Dec 09 '21 at 23:27

The slow repeated vstack:

In [200]: n=5; l=2
     ...: big_array = np.zeros([1,l])
     ...: for i in range(1,n):
     ...:     big_array = np.vstack([big_array, np.full([i,l], i)])
     ...: 
In [201]: big_array
Out[201]: 
array([[0., 0.],
       [1., 1.],
       [2., 2.],
       [2., 2.],
       [3., 3.],
       [3., 3.],
       [3., 3.],
       [4., 4.],
       [4., 4.],
       [4., 4.],
       [4., 4.]])

list append is faster:

In [202]: alist = []
In [203]: for i in range(1,n):
     ...:     alist.append(np.full([i,l], i))
     ...: 
     ...: 
In [204]: alist
Out[204]: 
[array([[1, 1]]),
 array([[2, 2],
        [2, 2]]),
 array([[3, 3],
        [3, 3],
        [3, 3]]),
 array([[4, 4],
        [4, 4],
        [4, 4],
        [4, 4]])]
In [205]: np.vstack(alist)
Out[205]: 
array([[1, 1],
       [2, 2],
       [2, 2],
       [3, 3],
       [3, 3],
       [3, 3],
       [4, 4],
       [4, 4],
       [4, 4],
       [4, 4]])

filling a preallocated array:

In [210]: arr = np.zeros((10,2),int)
     ...: cnt=0
     ...: for i in range(0,n):
     ...:     arr[cnt:cnt+i,:] = np.full([i,l],i)
     ...:     cnt += i
     ...: 
In [211]: arr
Out[211]: 
array([[1, 1],
       [2, 2],
       [2, 2],
       [3, 3],
       [3, 3],
       [3, 3],
       [4, 4],
       [4, 4],
       [4, 4],
       [4, 4]])

Great, this is was I was looking for, the list append is even faster than the original broadcasting method for big number of instances, thanks!. — Samuel Rodríguez, Dec 10 '21 at 11:33

Efficient way to load big heterogeneous dataset in a numpy array?

1 Answers1