Generate a data set consisting of N=100 2-dimensional samples

Question

How do I generate a data set consisting of N = 100 2-dimensional samples x = (x1,x2)T ∈ R2 drawn from a 2-dimensional Gaussian distribution, with mean

µ = (1,1)T

and covariance matrix

Σ = (0.3 0.2 
     0.2 0.2)

I'm told that you can use a Matlab function randn, but don't know how to implement it in Python?

The Python modules `numpy`, `scipy` and `MDP` implement a large amount of the things you can do with Matlab. — Will, Feb 17 '13 at 11:02

unutbu · Accepted Answer · 2013-02-17T20:48:40.517

11

Just to elaborate on @EamonNerbonne's answer: the following uses Cholesky decomposition of the covariance matrix to generate correlated variables from uncorrelated normally distributed random variables.

import numpy as np
import matplotlib.pyplot as plt
linalg = np.linalg

N = 1000
mean = [1,1]
cov = [[0.3, 0.2],[0.2, 0.2]]
data = np.random.multivariate_normal(mean, cov, N)
L = linalg.cholesky(cov)
# print(L.shape)
# (2, 2)
uncorrelated = np.random.standard_normal((2,N))
data2 = np.dot(L,uncorrelated) + np.array(mean).reshape(2,1)
# print(data2.shape)
# (2, 1000)
plt.scatter(data2[0,:], data2[1,:], c='green')    
plt.scatter(data[:,0], data[:,1], c='yellow')
plt.show()

enter image description here

The yellow dots were generated by np.random.multivariate_normal. The green dots were generated by multiplying normally distributed points by the Cholesky decomposition matrix L.

edited Feb 17 '13 at 20:48

answered Feb 17 '13 at 13:14

unutbu

842,883
184
1,785
1,677

Yeah, this is exactly what I had in mind :-) - well worked out example! – Eamon Nerbonne Feb 17 '13 at 14:10
When I try running it, I get this error: L = linalg.cholesky(cov) NameError: name 'linalg' is not defined – pythonnewbie Feb 17 '13 at 20:32
Whoops. I forgot to include `linalg = np.linalg`. (Post has been corrected.) – unutbu Feb 17 '13 at 20:49
Thank you! Would you care to take a look at my next question concerning this data set: http://stackoverflow.com/questions/14922607/estimation-of-max-likelihood-sample-mean-and-sample-covariance – pythonnewbie Feb 17 '13 at 22:09
It's important to have an inkling how these multivariate normal distributions work if you want to program with them in more complex scenarios, but if you're just trying to generate a few numbers, by all means use the prepackaged `numpy.random.multivariate_normal` as ATOzTOA suggests. – Eamon Nerbonne Feb 18 '13 at 15:31
Could you please explain why did you use Cholesky decomposition matrix `L` ? What is its significance ? Because I am also working in an experiment, where I am generating data myself, from gaussian distribution. – Shyamkkhadka Jan 27 '18 at 13:10

score 5 · Answer 2 · answered Feb 17 '13 at 11:00

You are looking for numpy.random.multivariate_normal

Code

>>> import numpy
>>> print numpy.random.multivariate_normal([1,1], [[0.3, 0.2],[0.2, 0.2]], 100)
[[ 0.02999043  0.09590078]
 [ 1.35743021  1.08199363]
 [ 1.15721179  0.87750625]
 [ 0.96879114  0.94503228]
 [ 1.23989167  1.13473083]
 [ 1.55917608  0.81530847]
 [ 0.89985651  0.7071519 ]
 [ 0.37494324  0.739433  ]
 [ 1.45121732  1.17168444]
 [ 0.69680785  1.2727178 ]
 [ 0.35600769  0.46569276]
 [ 2.14187488  1.8758589 ]
 [ 1.59276393  1.54971412]
 [ 1.71227009  1.63429704]
 [ 1.05013136  1.1669758 ]
 [ 1.34344004  1.37369725]
 [ 1.82975724  1.49866636]
 [ 0.80553877  1.26753018]
 [ 1.74331784  1.27211784]
 [ 1.23044292  1.18110192]
 [ 1.07675493  1.05940509]
 [ 0.15495771  0.64536509]
 [ 0.77409745  1.0174171 ]
 [ 1.20062726  1.3870498 ]
 [ 0.39619719  0.77919884]
 [ 0.87209168  1.00248145]
 [ 1.32273339  1.54428262]
 [ 2.11848535  1.44338789]
 [ 1.45226461  1.42061198]
 [ 0.33775737  0.24968543]
 [ 1.06982557  0.64674411]
 [ 0.92113229  1.0583153 ]
 [ 0.54987592  0.73198037]
 [ 1.06559727  0.77891362]
 [ 0.84371805  0.72957046]
 [ 1.83614557  1.40582746]
 [ 0.53146009  0.72294094]
 [ 0.98927818  0.73732053]
 [ 1.03984002  0.89426628]
 [ 0.38142362  0.32471126]
 [ 1.44464929  1.15407227]
 [-0.22601279  0.21045592]
 [-0.01995875  0.45051782]
 [ 0.58779449  0.44486237]
 [ 1.31335981  0.92875936]
 [ 0.42200098  0.6942829 ]
 [ 0.10714426  0.11083002]
 [ 1.44997839  1.19052704]
 [ 0.78630506  0.45877582]
 [ 1.63432202  1.95066539]
 [ 0.56680926  0.92203111]
 [ 0.08841491  0.62890576]
 [ 1.4703602   1.4924649 ]
 [ 1.01118864  1.44749407]
 [ 1.19936276  1.02534702]
 [ 0.67893239  0.8482461 ]
 [ 0.71537211  0.53279103]
 [ 1.08031573  1.00779064]
 [ 0.66412568  0.57121041]
 [ 0.96098528  0.72318386]
 [ 0.7690299   0.76058713]
 [ 0.77466896  0.77559282]
 [ 0.47906664  0.58602633]
 [ 0.52481326  0.78486453]
 [-0.40240438  0.17374116]
 [ 0.75730444  0.22365892]
 [ 0.67811008  1.17730408]
 [ 1.62245699  1.71775386]
 [ 1.12317847  1.04252136]
 [-0.06461117  0.23557416]
 [ 0.46299482  0.51585414]
 [ 0.88125676  1.23284201]
 [ 0.57920534  0.63765861]
 [ 0.88239858  1.32092112]
 [ 0.63500551  0.94788141]
 [ 1.76588148  1.63856465]
 [ 0.65026599  0.6899672 ]
 [ 0.06854287  0.29712499]
 [ 0.61575737  0.87526625]
 [ 0.30057552  0.54475194]
 [ 0.66578769  0.21034844]
 [ 0.94670438  0.7699764 ]
 [ 0.39870371  0.91681577]
 [ 1.37531351  1.62337899]
 [ 1.92350877  1.34382017]
 [ 0.56631877  0.77456137]
 [ 1.18702642  0.63700271]
 [ 0.74002244  1.04535471]
 [ 0.3272063   0.75097037]
 [ 1.57583435  1.55809705]
 [ 0.44325124  0.39620769]
 [ 0.59762516  0.58304621]
 [ 0.72253698  0.68302097]
 [ 0.93459597  1.01101948]
 [ 0.50139577  0.52500942]
 [ 0.84696441  0.68679341]
 [ 0.63483432  0.22205385]
 [ 1.43642478  1.34724612]
 [ 1.58663111  1.49941374]
 [ 0.73832806  0.95690866]]
>>>

score 3 · Answer 3 · answered Feb 17 '13 at 12:00

Although numpy has handy utility functions, you can always "rescale" multiple independant normally distributed variables to match your given covariance matrix. So if you can generate a column-vector x (or many vectors grouped in a matrix) in which each element is normally distributed, and you scale by matrix M, the result will have covariance M M^T. Conversely, if you decompose your covariance C into the form M M^T then it's really simple to generate such a distribution even without the utility functions numpy provides (just multiply your bunch of normally distributed vectors by M).

This is perhaps not the answer you're directly looking for, but it's useful to keep in mind e.g.:

if you ever find yourself scaling the result of the random generation, you could instead combine the scaling with your initial covariance
if you need to ever port code to libraries that don't directly support such a utility method it's very easy to implement yourself.

Generate a data set consisting of N=100 2-dimensional samples

3 Answers3