How to generate sample data with exact moments

Question

How to generate exact data?

In R we have the option to use rnorm to sample from a population with certain characteristics (say, mean=0, sd=1), but how do we get data with exactly mean=0, sd=1?

This is a simple example. I would also be interested in more general ways of applying ways to get exact data (for instance, multivariate data with an exact correlation of 0.2)

you want the sample mean to be exactly equal to population mean? I doubt if its possible. — Nishanth, Apr 20 '13 at 16:23
Are you asking how to do this in R or python? Or either? You could always just force your generated data to have the mean and standard deviation of interest using a linear transformation. — Dason, Apr 20 '13 at 16:30
Also are you just interested in doing this for the Gaussian case or do you want to do it for any distribution in general? — Dason, Apr 20 '13 at 16:31

score 4 · Answer 1 · answered Apr 20 '13 at 16:30

4

Simply scale your results. In the univariate case:

set.seed(21)
x <- rnorm(1000)
mean(x)
sd(x)
y <- x-mean(x)
y <- y/sd(x)
mean(y)  # within floating point precision of 0
sd(y)

The multivariate case is a bit more involved, but possible.

answered Apr 20 '13 at 16:30

Joshua Ulrich

173,410
32
338
418

Perhaps show the multiplication and addition to then get an exact arbitrary result. – John Apr 20 '13 at 16:35
@flodel: pssh, too easy! ;-) – Joshua Ulrich Apr 20 '13 at 16:35

John · Answer 2 · 2013-04-20T16:48:38.590

It sounds like you want mvrnorm in the MASS package.

sigma <- matrix(c(1.0, 0.0, -0.5,
                  0.0, 1.0,  0.5,
                 -0.5, 0.5,  1.0), 3, byrow = TRUE)
mat <- mvrnorm(10, c(0,0,0), sigma, empirical = TRUE)
cor(mat)
#     [,1]  [,2]  [,3]
#[1,]  1.0   0.0  -0.5
#[2,]  0.0   1.0   0.5
#[3,] -0.5   0.5   1.0

Note that by selecting SDs of 1 for each group I simplified things because the covariance will equal the correlation but you can generalize this by remembering that the correlation is the covariance divided by the product of SDs.

(note that when you run the code you may not get exact values but values within machine accuracy... which is all we can hope for)

score 2 · Answer 3 · answered Apr 20 '13 at 16:41

You can simply rescale the data.

n <- 100
x <- rnorm(n)
x <- ( x - mean(x) ) / sd(x)
mean(x)   # 0, up to machine precision
sd(x)     # 1

You could also use ppoints to have evenly-spaced points (you still have to rescale, though).

x <- qnorm( ppoints(n) )
x <- ( x - mean(x) ) / sd(x)
mean(x)
sd(x)

In higher dimension, the transformation is a bit trickier. If x is a Gaussian vector, with mean zero and variance the identity matrix, then C %*% x is Gaussian, with zero mean, and variance matrix V = CC'. C is the Cholesky transform of V; it can be seen as an analogue of the square root for (symmetric, positive semi-definite) matrices.

Two of those transformations are actually needed: the first one to set the variance to the identity, the second to set it to the desired value.

# Desired variance matrix
V <- matrix( c(1,.2,.2, .2,1,.2, .2,.2,1), 3, 3 )

# Random data
n <- 100
k <- 3
x <- matrix( rnorm(k*n), nc=3 )

# Set the mean to 0, and the variance to the identity
x <- t( t(x) - colMeans(x) )
colMeans(x)   # 0
C1 <- chol(var(x))
x <- x %*% solve(C1)
var(x)   # identity matrix

# Set the variance to the desired value
C2 <- chol(V)
x <- x %*% C2
var(x) - V   # zero

How to generate sample data with exact moments

3 Answers3

Linked