I have to load a dataset into a big array with p
instances where each instance has 2 dimensions (n_i, m)
. The length of the first dimension n_i
is variable.
My first approach was to pad all the instances to the max_len
over the first dimension, initialize an array of size (p, max_len, m)
and then broadcast each instance into the big array as follows big_array[i*max_len:i*max_len+max_len] = padded_i_instance
. This is fast and works well, the problem is that I only have 8Gb of RAM and I get (interrupted by signal 9: SIGKILL) error
when I try to load the whole dataset. It also feels very wasteful since the shortest instance is almost 10 times shorter than the max_len
so some instances are 90% padding.
My second approach was to use np.vstack
and then build the big_array
iteratively. Something like this:
big_array = np.zeros([1,l])
for i in range(1,n):
big_array = np.vstack([big_array, np.full([i,l], i)])
this feels less "wasteful" but it actually takes 100x longer to execute for only 10000 instances, it is unfeasible to use for 100k+.
So I was wondering if there was a method that is both more memory efficient than approach 1 and more computationally efficient than approach 2. I read about np.append
and np.insert
but they seem to be other versions of np.vstack
so I assume the would take roughly as much time.