Extract and add to the data values of the probability density function based on a stan linear model

Question

Given the sample data sampleDT and models lm.fit and brm.fit below, I would like to:

estimate, extract and add to the data frame the values of the density function for a conditional normal distribution evaluated at the observed level of the variable dollar.wage_1.

I can do this using a frequentist linear regression lm.fit and dnorm but my attempt to do the same using a bayesian brm.fit model fails. Therefore, any help would be much appreciated.

##sample data

sampleDT<-structure(list(id = 1:10, N = c(10L, 10L, 10L, 10L, 10L, 10L, 
    10L, 10L, 10L, 10L), A = c(62L, 96L, 17L, 41L, 212L, 143L, 143L, 
    143L, 73L, 73L), B = c(3L, 1L, 0L, 2L, 170L, 21L, 0L, 33L, 62L, 
    17L), C = c(0.05, 0.01, 0, 0.05, 0.8, 0.15, 0, 0.23, 0.85, 0.23
    ), employer = c(1L, 1L, 0L, 1L, 0L, 1L, 1L, 0L, 0L, 0L), F = c(0L, 
    0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 1L), G = c(1.94, 1.19, 1.16, 
    1.16, 1.13, 1.13, 1.13, 1.13, 1.12, 1.12), H = c(0.14, 0.24, 
    0.28, 0.28, 0.21, 0.12, 0.17, 0.07, 0.14, 0.12), dollar.wage_1 = c(1.94, 
    1.19, 3.16, 3.16, 1.13, 1.13, 2.13, 1.13, 1.12, 1.12), dollar.wage_2 = c(1.93, 
    1.18, 3.15, 3.15, 1.12, 1.12, 2.12, 1.12, 1.11, 1.11), dollar.wage_3 = c(1.95, 
    1.19, 3.16, 3.16, 1.14, 1.13, 2.13, 1.13, 1.13, 1.13), dollar.wage_4 = c(1.94, 
    1.18, 3.16, 3.16, 1.13, 1.13, 2.13, 1.13, 1.12, 1.12), dollar.wage_5 = c(1.94, 
    1.19, 3.16, 3.16, 1.14, 1.13, 2.13, 1.13, 1.12, 1.12), dollar.wage_6 = c(1.94, 
    1.18, 3.16, 3.16, 1.13, 1.13, 2.13, 1.13, 1.12, 1.12), dollar.wage_7 = c(1.94, 
    1.19, 3.16, 3.16, 1.14, 1.13, 2.13, 1.13, 1.12, 1.12), dollar.wage_8 = c(1.94, 
    1.19, 3.16, 3.16, 1.13, 1.13, 2.13, 1.13, 1.12, 1.12), dollar.wage_9 = c(1.94, 
    1.19, 3.16, 3.16, 1.13, 1.13, 2.13, 1.13, 1.12, 1.12), dollar.wage_10 = c(1.94, 
    1.19, 3.16, 3.16, 1.13, 1.13, 2.13, 1.13, 1.12, 1.12)), row.names = c(NA, 
    -10L), class = "data.frame")

##frequentist model: this works

lm.fit <-lm(dollar.wage_1 ~ A + B + C + employer + F + G + H,
            data=sampleDT)

sampleDT$dens1 <-dnorm(sampleDT$dollar.wage_1,mean=lm.fit$fitted,
sd=summary(lm.fit)$sigma)

##bayesian model: this is my attempt - it does not work

//this works
brm.fit <-brm(dollar.wage_1 ~ A + B + C + employer + F + G + H,
            data=sampleDT, iter = 4000, family = gaussian())

//this does not work
 sampleDT$dens1_bayes <-dnorm(sampleDT$dollar.wage_1, mean = fitted(brm.fit), sd=summary(brm.fit)$sigma)

Error in dnorm(sampleDT$dollar.wage_1, mean = brm.fit$fitted, sd = summary(brm.fit)$sigma) : Non-numeric argument to mathematical function

Thanks in advance for any help.

score 1 · Accepted Answer · answered Feb 08 '19 at 17:36

1

We have that now fitted(brm.fit) is a matrix, so we want to use only its first column - that of estimates. Also, as there is no reason for the object structure to be the same, summary(brm.fit)$sigma gives nothing. Instead you want summary(brm.fit)$spec_pars[1]. Hence, you may use

sampleDT$dens1_bayes <- dnorm(sampleDT$dollar.wage_1,
                              mean = fitted(brm.fit)[, 1],
                              sd = summary(brm.fit)$spec_pars[1])

answered Feb 08 '19 at 17:36

Julius Vainora

47,421
9
90
102

Great, @JuliusVainora. hanks for the answer. Very helpful. But I am a bit concerned. Why are these sampleDT$dens1_bayes very different from sampleDT$dens1? I noticed that `> sampleDT$dens1_freq [1] 0.5313967 0.4377899 0.5309715 0.4308041 0.5297744 0.5247409 0.5275020 0.4069652 0.5295822 0.3930264` whereas `> sampleDT$dens1_bayes [1] 0.1644518 0.1613566 0.1644267 0.1621689 0.1644273 0.1641519 0.1642465 0.1591944 0.1642170 0.1601089`. Should not these two be somehow approximately equal, at least not too much different? – Krantz Feb 08 '19 at 17:52
@Krantz, the difference comes from `sd`, which is twice as high in the bayesian model. I'm not entirely sure how to read (the 0 and 10 part) the prior of `sigma` in `brm.fit$prior`, but 3 degrees of freedom for a t distribution combined with a very small sample can mean that posterior uncertainty will remain high. The only other explanation would be that this `sigma` is some entirely different parameter, but I doubt that. – Julius Vainora Feb 08 '19 at 17:59
Yes. As you say, one possibility is that it is because of the `very small sample` and the other is the `the prior of sigma in brm.fit$prior (...) 3 degrees of freedom for a t distribution`. The `sd` are extremely different: `DT$sd_freq [1] 0.7506886` whereas `sampleDT$sd_bayes [1] 2.425812`. The data is the same, so the results from bayesian and frequentist should not be too much different like this when using `default configurations` of the packages. Any thoughts? – Krantz Feb 08 '19 at 18:19
@Krantz, I'd suggest to generate some data and to try to estimate a model using a small subsample and a full (relatively large) sample. And to compare the resulting average `sd` with the true one. – Julius Vainora Feb 08 '19 at 18:24
Thanks, @JuliusVainora. I will do that. – Krantz Feb 08 '19 at 18:25
Hi, @JuliusVainora. A related question has been posted at https://stackoverflow.com/questions/54615821/extract-and-add-to-the-data-frame-the-values-of-sigma-from-a-stan-distributional. Thanks in advance for any help. – Krantz Feb 10 '19 at 11:19

Extract and add to the data values of the probability density function based on a stan linear model

1 Answers1