Calculate mean, variance, covariance of different length matrices in a split list

Question

I have an array of 5 values, consisting of 4 values and one index. I sort and split the array along the index. This leads me to splits of matrices with different lengths. From here on I want to calculate the mean, variance of the fourth values and covariance of the first 3 values for every split. My current approach works with a for loop, which I would like to replace by matrix operations, but I am struggeling with the different sizes of my matrices.

import numpy as np
A = np.random.rand(10,5) 
A[:,-1] = np.random.randint(4, size=10)
sorted_A = A[np.argsort(A[:,4])]
splits = np.split(sorted_A, np.where(np.diff(sorted_A[:,4]))[0]+1)

My current for loop looks like this:

result = np.zeros((len(splits), 5))
for idx, values in enumerate(splits):
    if(len(values))>0:
        result[idx, 0] = np.mean(values[:,3])
        result[idx, 1] = np.var(values[:,3])
        result[idx, 2:5] = np.cov(values[:,0:3].transpose(), ddof=0).diagonal()
    else:
        result[idx, 0] = values[:,3]

I tried to work with masked arrays without success, since I couldn't load the matrices into the masked arrays in a proper form. Maybe someone knows how to do this or has a different suggestion.

Padding the short arrays with `nan` may help. There are some 'no nan' functions. — hpaulj, Apr 20 '18 at 06:32

score 2 · Accepted Answer · answered Apr 20 '18 at 06:33

You can use np.add.reduceat as follows:

>>> idx = np.concatenate([[0], np.where(np.diff(sorted_A[:,4]))[0]+1, [A.shape[0]]])
>>> result2 = np.empty((idx.size-1, 5))
>>> result2[:, 0] = np.add.reduceat(sorted_A[:, 3], idx[:-1]) / np.diff(idx)
>>> result2[:, 1] = np.add.reduceat(sorted_A[:, 3]**2, idx[:-1]) / np.diff(idx) - result2[:, 0]**2
>>> result2[:, 2:5] = np.add.reduceat(sorted_A[:, :3]**2, idx[:-1], axis=0) / np.diff(idx)[:, None]
>>> result2[:, 2:5] -= (np.add.reduceat(sorted_A[:, :3], idx[:-1], axis=0) / np.diff(idx)[:, None])**2
>>> 
>>> np.allclose(result, result2)
True

Note that the diagonal of the covariance matrix are just the variances which simplifies this vectorization quite a bit.

Calculate mean, variance, covariance of different length matrices in a split list

1 Answers1