0

I am fitting a model (very much simplified for reproducible) like so:

library(datasets) library(rstanarm)

set.seed(42)
rm(list = ls(all = TRUE))

prediction_data1 <- data.frame(
        Petal.Length = 1.4
    )

prediction_data2 <- data.frame(
        Petal.Length = 1.4
    )

model <- stan_glm(
        Petal.Width ~ Petal.Length
        , data = iris
        , chains = 3
        , iter = 1000
        , warmup = 100
)

new_predictions1 <- as.data.frame(posterior_predict(model, newdata = prediction_data1))
new_predictions2 <- as.data.frame(posterior_predict(model, newdata = prediction_data2))

colnames(new_predictions1) <- c('Petal.Width')
colnames(new_predictions2) <- c('Petal.Width')

median(new_predictions1$Petal.Width)
median(new_predictions2$Petal.Width)

I guess I should not expect the same median for both equal 'simulation' datasets (e.g. 0.2216209 and 0.2177802)? However is the above skeleton code a correct approach to simulate different scenarios ( e.g. Petal.Length = 1.4 versus Petal.Length = 2.4)?

cs0815
  • 16,751
  • 45
  • 136
  • 299
  • Are both predictions supposed to be 1.4, since you mention it as 1.4 vs 2.4? – camille Nov 04 '18 at 17:44
  • Yes in the case of the example showing that despite them being equal the median of the 2 samples are different. Maybe this is to be expected? For actual simulation the values would be different. – cs0815 Nov 04 '18 at 18:14
  • Honestly don't know, I was just checking you hadn't made a typo since you mentioned 2.4 – camille Nov 04 '18 at 18:16
  • I have not used rstanarm and can't say anything about your skeleton code. However, it's completely realistic for two MCMC runs to yield slightly different medians, particularly with only 1000 iterations. I'd recommend at least 10k or even 100k if you want precision, and you'll notice better agreement between runs as well. – Matt Tyers Nov 05 '18 at 19:54
  • Thanks. I guess you mean the posterior predictions will be more precise? – cs0815 Nov 05 '18 at 20:07
  • If you obtain more draws from the posterior distribution, then yes the difference in medians between runs will be smaller. But it is not generally advisable to get 10k or 100k draws with Stan like you might with other MCMC samplers that are positively autocorrelated. With Stan, the first order autocorrelation can be negative in which case the effective sample size can be greater than the nominal number of draws. Anyway, it is the inverse square root of the effective sample size that the Monte Carlo error is proportional to, not the nominal number of draws. Most people don't need .01 precision. – Ben Goodrich Nov 17 '18 at 20:30

0 Answers0