creating a large pdf matrix efficiently

Question

I have a dataset of 60,000 examples of the form:

      mu1     mu2    std1    std2
0   -0.745   0.729  0.0127  0.0149
1   -0.711   0.332  0.1240  0.0433
...

They are essentially parameters of 2-dimensional normal distributions. What I want to do is create a (NxN) matrix P such that P_ij = Normal( mu_i | mean=mu_j, cov=diagonal(std_j)), where mu_i is (mu1, mu2) for data 'i'.

I can do this with the following code for example:

from scipy import stats
import numpy as np

mu_all = data[['mu1', 'mu2']]
std_all = data[['std1', 'std2']]
P = []
for i in range(len(data)):
    mu_i = mu_all[i,:]
    std_i = std_all[i,:]
    prob_i = stats.multivariate_normal.pdf(mu_all, mean=mu_i, cov=np.diag(std_i))
    P.append(prob_i)

P = np.array(P).T

But this is too expensive (my machine freezes). How can I do this more efficiently? My guess is that scipy cannot handle computing pdf of 60000 at once. Is there an alternative?

scipy can compute such a pdf at 60000 points in ~26 ms in my computer, doing it 60000 times would take ~26 min. Unfortunately my computer seems unable to allocate a 60000 x 60000 array of float64. — Stop harming Monica, Nov 03 '17 at 18:53
yes sorry that wasn't a problem. The large matrix is the problem. Thanks — Babak, Nov 04 '17 at 17:23

score 0 · Accepted Answer · answered Nov 04 '17 at 17:24

0

Just realized creating a matrix of that size (60,0000 x 60,000) cannot be handled in python:

Very large matrices using Python and NumPy

So I don't think this can be done

answered Nov 04 '17 at 17:24

Babak

497
1
7
15

creating a large pdf matrix efficiently

1 Answers1