3

I struggle with the following task: I need to generate data from a truncated normal distribution. The sample mean and standard deviation should match exactly those specified in the population. This is what I have so far:

    mean <- 100
    sd <- 5
    lower <- 40
    upper <- 120
    n <- 100   

    library(msm)    
    data <- as.numeric(mean+sd*scale(rtnorm(n, lower=40, upper=120)))

The sample that's created takes on exactly the mean and sd specified in the population. But some values exceed the intended bounds. Any idea how to fix this? I was thinking of just cutting off all values outside these bounds, but then mean and sd don't resemble those of the population anymore.

Lion Behrens
  • 199
  • 1
  • 11
  • if you want the values of the first two moments to match exactly you might edit your question to say "match exactly" rather than "resemble exactly"; "resemble" suggests that the values should be match approximately, but need not match precisely – Ben Bolker Jul 29 '17 at 13:41
  • 1
    We should let them respond. There are various reasons one might want to do this (I've certainly seen this question asked for other distributions, although I can't find them right now); I'm going to vote to close as unclear in the meantime ... – Ben Bolker Jul 29 '17 at 13:57
  • Sorry for formulating the question vaguely. @BenBolker you are right, the values in the generated sample should match the values specified for the population exactly! Just using the rtnorm function won't solve this because of the slight deviations that come with sampling from a distribution. – Lion Behrens Jul 29 '17 at 14:04
  • I am aware that this might be perceived as uncommon. – Lion Behrens Jul 29 '17 at 14:05

1 Answers1

4

You could use an iterative answer. Here I add samples one by one to the vector, but only if the resulting scaled dataset remains within the boundaries that you set. It takes longer, but it works:

n <- 10000
mean <- 100
sd <- 15
lower <- 40
upper <- 120

data <- rtnorm(1, lower=((lower - mean)/sd), upper=((upper - mean)/sd))
while (length(data) < n) {
  sample <- rtnorm(1, lower=((lower - mean)/sd), upper=((upper - mean)/sd))
  data_copy = c(data, sample)
  data_copy_scaled = mean + sd * scale(data_copy)
  if (min(data_copy_scaled) >= lower & max(data_copy_scaled) <= upper) {
    data = c(data, sample)
  }
}

scaled_data = as.numeric(mean + sd * scale(data))

summary(scaled_data)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  40.38   91.61  104.35  100.00  111.28  120.00

sd(scaled_data)

15

Below my old answer, which doesn't quite work

How about scaling the lower and upper limits of rtnorm with the mean and sd that you want?

n <- 1000000
mean <- 100
sd <- 5

library(msm)

data <- as.numeric(mean+sd*scale(rtnorm(n, lower=((40 - mean)/sd), upper=((120 - mean)/sd))))

summary(data)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  76.91   96.63  100.00  100.00  103.37  120.00 

sd(data)

5

In this case, even with a sample of 1000000 you get the exact mean and sd, and the max and min values remain within your boundaries.

Oriol Mirosa
  • 2,756
  • 1
  • 13
  • 15
  • @OriolMirosa this edited version of your answer is exactly what I was looking for. Thank you a lot! – Lion Behrens Jul 29 '17 at 14:15
  • Great! I'm glad it worked. Don't forget to mark the answer as correct so others can find it. – Oriol Mirosa Jul 29 '17 at 14:16
  • I'm sorry to say that it turned out it's still not quite doing the job. The proposed code works very well if sd is set to a low value. However, try out e.g. `mean=100, sd=15, n=10000` with a lower boundary of e.g. 40 and an upper boundary of e.g. 120. With a higher standard deviation, the code will still produce values well outside the desired boundaries. – Lion Behrens Jul 29 '17 at 14:52
  • Yes, you are right. However, I'm not sure how you can prevent this. When you sample from a truncated normal, the standard deviation is that of the normal, but since it is truncated, the sample has a much smaller standard deviation. You can check by calculating the sd of `data <- rtnorm(n=10000, mean=100, sd=15, lower=40, upper=120)`. The last time I ran it, the sd was 10.28. Therefore, when you scale the sample so that it has a standard deviation of 15, you are necessarily spacing out the observations, and these are bound to move beyond your original bounds. Does this make sense? – Oriol Mirosa Jul 29 '17 at 18:13
  • Yes, this makes perfectly sense. I definitely need to create a vector of n numbers that exactly matches a pre-specified mean and sd and lies within fixed lower and upper bounds though. Thus, I am still in search for a valid solution to this problem. The proposed code will satisfy the first condition (mean and sd exactly match the pre-specified ones), but not the second one (all data values should lie within fixed bounds). – Lion Behrens Jul 30 '17 at 09:54
  • This is exactly what's in need. Thanks a lot for your efforts! – Lion Behrens Jul 31 '17 at 06:07