3

How to generate exact data?

In R we have the option to use rnorm to sample from a population with certain characteristics (say, mean=0, sd=1), but how do we get data with exactly mean=0, sd=1?

This is a simple example. I would also be interested in more general ways of applying ways to get exact data (for instance, multivariate data with an exact correlation of 0.2)

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
PascalVKooten
  • 20,643
  • 17
  • 103
  • 160
  • you want the sample mean to be exactly equal to population mean? I doubt if its possible. – Nishanth Apr 20 '13 at 16:23
  • Are you asking how to do this in R or python? Or either? You could always just force your generated data to have the mean and standard deviation of interest using a linear transformation. – Dason Apr 20 '13 at 16:30
  • Also are you just interested in doing this for the Gaussian case or do you want to do it for any distribution in general? – Dason Apr 20 '13 at 16:31
  • @Dason Any distribution, in either language. – PascalVKooten Apr 20 '13 at 22:11

3 Answers3

4

Simply scale your results. In the univariate case:

set.seed(21)
x <- rnorm(1000)
mean(x)
sd(x)
y <- x-mean(x)
y <- y/sd(x)
mean(y)  # within floating point precision of 0
sd(y)

The multivariate case is a bit more involved, but possible.

Joshua Ulrich
  • 173,410
  • 32
  • 338
  • 418
3

It sounds like you want mvrnorm in the MASS package.

sigma <- matrix(c(1.0, 0.0, -0.5,
                  0.0, 1.0,  0.5,
                 -0.5, 0.5,  1.0), 3, byrow = TRUE)
mat <- mvrnorm(10, c(0,0,0), sigma, empirical = TRUE)
cor(mat)
#     [,1]  [,2]  [,3]
#[1,]  1.0   0.0  -0.5
#[2,]  0.0   1.0   0.5
#[3,] -0.5   0.5   1.0

Note that by selecting SDs of 1 for each group I simplified things because the covariance will equal the correlation but you can generalize this by remembering that the correlation is the covariance divided by the product of SDs.

(note that when you run the code you may not get exact values but values within machine accuracy... which is all we can hope for)

John
  • 23,360
  • 7
  • 57
  • 83
2

You can simply rescale the data.

n <- 100
x <- rnorm(n)
x <- ( x - mean(x) ) / sd(x)
mean(x)   # 0, up to machine precision
sd(x)     # 1

You could also use ppoints to have evenly-spaced points (you still have to rescale, though).

x <- qnorm( ppoints(n) )
x <- ( x - mean(x) ) / sd(x)
mean(x)
sd(x)

In higher dimension, the transformation is a bit trickier. If x is a Gaussian vector, with mean zero and variance the identity matrix, then C %*% x is Gaussian, with zero mean, and variance matrix V = CC'. C is the Cholesky transform of V; it can be seen as an analogue of the square root for (symmetric, positive semi-definite) matrices.

Two of those transformations are actually needed: the first one to set the variance to the identity, the second to set it to the desired value.

# Desired variance matrix
V <- matrix( c(1,.2,.2, .2,1,.2, .2,.2,1), 3, 3 )

# Random data
n <- 100
k <- 3
x <- matrix( rnorm(k*n), nc=3 )

# Set the mean to 0, and the variance to the identity
x <- t( t(x) - colMeans(x) )
colMeans(x)   # 0
C1 <- chol(var(x))
x <- x %*% solve(C1)
var(x)   # identity matrix

# Set the variance to the desired value
C2 <- chol(V)
x <- x %*% C2
var(x) - V   # zero
Vincent Zoonekynd
  • 31,893
  • 5
  • 69
  • 78