How to create 2D data set from Gaussian distribution in R?

Question

I need to create "2D data set with 200 samples created from a multivariate Gaussian distribution with a non-diagonal covariance matrix", but I'm neither a statistician nor a mathematician, and I didn't exactly get this.

Here is what I understood. Diagonal matrix is a matrix that has all zeros in the entries outside the main diagonal. Therefore, I assume non-diagonal means a matrix that doesn't have all zeros in the entries outside the main diagonal, such that any random matrix would do, right? So, I started by creating a random matrix, cause it doesn't say any size here, I just did 100x100:

m <- matrix(rnorm(100*100), 100, 100)

I don't know how to achieve the rest. I know the sample() function which creates a sample, but how can I create "2D data set with 200 samples created from a multivariate Gaussian distribution"?

score 3 · Accepted Answer · 2016-06-06T16:23:50.833

As long as you have mean vector and covariance matrix, simulating multivariate normal is very simple, via MASS:::mvrnorm. Have a look at ?mvrnorm for how to use this function.

If you do not have special requirement on the covariance matrix, i.e., a random covariance matrix will do. You need to first create a proper covariance matrix first.

A covariance matrix must be positive-definite. We can create a positive-definite matrix by taking crossproduct of a full-rank matrix. That is, if an n * p (n >= p) matrix X has full column rank, A = X' %*% X is positive-definite (hence a proper covariance).

Let's first generate a random X matrix:

p <- 100  ## we want p-dimensional multivariate normal
set.seed(0); X <- matrix(runif(p * p), p, p)  ## this random matrix has full rank

Then get a covariance matrix:

COV <- crossprod(X)  ## t(X) %*% X but about 2 times faster

We also need mean vector. Let's assume they are 0-mean:

mu <- rep(0, p)

Now we call MASS:::mvrnorm for random sampling:

library(MASS)   ## no need to install
x <- mvrnorm(1000, mu, COV)  ## mvrnorm(sample.size, mean, covariance)

Now x contains 1000 samples from 100-dimension (p-dimensional) multivariate normal distribution, with mean mu and covariance COV.

> str(x)
 num [1:1000, 1:100] 1.66 -2.82 6.62 6.46 -3.35 ...
 - attr(*, "dimnames")=List of 2

x is a matrix, each row of which is a random sample. So in total we have 1000 rows.

For multivariate normal, marginal distribution is still normal. Hence, we can plot histograms for marginals. The following sketches the 1st, 10th, 20th and 30th marginal:

par(mfrow = c(2,2))
hist(x[, 1], main = "1st marginal")
hist(x[, 10], main = "10th marginal")
hist(x[, 20], main = "20th marginal")
hist(x[, 30], main = "30th marginal")

How to create 2D data set from Gaussian distribution in R?

1 Answers1