0

I am trying to solve a statistics-related real world problem with Python and am looking for inputs on my ideas: I have N random vectors from a m-dimensional normal distribution. I have no information about the means and the covariance matrix of the underlying distribution, in fact also that it is a normal distribution is only an assumption, a very plausible one though. I want to compute an approximation of the mean vector and covariance matrix of the distribution. The number of random vectors is in the order of magnitude of 100 to 300, the dimensionality of the normal distribution is somewhere between 2 and 5. The time for the calculation should ideally not exceed 1 minute on a standard home computer.

I am currently thinking about three approaches and am happy about all suggestions for other approaches or preferences between those three:

  1. Fitting: Make a multi dimensional histogram of all random vectors and fit a multi dimensional normal distribution to the histogram. Problem about that approach: The covariance matrix has many entries, this could possibly be a problem for the fitting process?

  2. Invert cumulative distribution function: Make a multi dimensional histogram as approximation of the density function of the random vectors. Then integrate this to get a multi dimensional cumulative distribution function. For one dimension, this is invertible and one could use the cum-dist function to distribute random numbers like in the original distribution. Problem: For the multi-dimensional case the cum-dist function is not invertible(?) and I don't know if this approach still works then?

  3. Bayesian: Use Bayesian Statistics with some normal distribution as prior and update for every observation. The result should always be again a normal distribution. Problem: I think this is computationally expensive? Also, I don't want the later updates have more impact on the resulting distribution than the earlier ones.

Also, maybe there is some library which has this task already implemented? I did not find exactly this in Numpy or Scipy, maybe someone has an idea where else to look?

FeWa
  • 141
  • 5
  • Have you tried the expectation-maximization algorithm? It is used to estimate parameters of a density function – politinsa Feb 23 '20 at 16:39
  • @politinsa That sounds like a very useful tool for my problem! Have you every applied this yourself in python or do you have suggestions for useful libraries that have implementations of that algorithm? – FeWa Feb 23 '20 at 17:09
  • Are the simple estimates described in the section [Parameter estimation](https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Parameter_estimation) of the wikipedia article on the multivariate normal distribution sufficient for your needs? If so, you can use [`numpy.cov`](https://numpy.org/devdocs/reference/generated/numpy.cov.html) to compute the sample covariance and `numpy.mean` to compute the mean. – Warren Weckesser Feb 23 '20 at 19:23
  • @WarrenWeckesser Yes a straight forward ML estimation really seems to be the best way here, especially because I expect the distribution to be a normal distribution anyway. Thank you for your help, if you post this as an answer, I will accept it. – FeWa Feb 23 '20 at 21:33

1 Answers1

1

If the simple estimates described in the section Parameter estimation of the wikipedia article on the multivariate normal distribution are sufficient for your needs, you can use numpy.mean to compute the mean and numpy.cov to compute the sample covariance matrix.

Warren Weckesser
  • 110,654
  • 19
  • 194
  • 214