I have a large set of N d-dimensional vectors (residing in a matrix) that I am lifting by taking the self outer product (i.e. each vector i times itself). For each vector this produces a symmetric matrix with (d+1) choose 2 unique entries. For the entire data, this is an N x d x d tensor. I would like to compute only the unique (d+1) choose 2 entries from the lower diagonal of each tensor slice and store them in a vector. I want to do this with as little memory footprint and as fast as possible in Python - including using C bindings.
If you do this using standard numpy methods, it allocates the entirety of each matrix. This is about double the memory complexity of what is actually required.
For a sense of scale here, consider the case where N = 20k and d = 20k. Then N * d^2 * ~8bytes per element = (2*10^4)^3 * 8 bytes = 64 terabytes.
If we only compute the vectors that encode the unique entries, we have (20001 choose 2) * 20k * 8 = 200010000 * 20000 * 8 bytes = 32 terabytes.
Is there a quick way to do this without resorting to slow methods (such as coding my own outer product in python)?
Edit: I'll note that a similar question was asked in Create array of outer products in numpy
I already know how to compute this using einsum (as in the above question). However, no answer was reached about doing this without the extra (d choose 2) computations and allocations
Edit 2: This thread How to exploit symmetry in outer product in Numpy (or other Python solutions)? asks a related question but does not address the memory complexity. The top answer will still allocate a d x d array for each outer product.
This thread Numpy Performance - Outer Product of a vector with its transpose also addressed computational considerations of a self outer product, but does not reach a memory efficient solution.
Edit 3:
If one wants to allocate the whole array and then extract the elements, np.tril_indices
or scipy.spatial.distance.squareform
will do the trick.