2

I found this question online. Can someone explain in details please, why using OLS is better? Is it only because the number of samples is not enough? Also, why not use all the 1000 samples to estimate the prior distribution?

We have 1000 randomly sampled data points. The goal is to try to build a regression model with one response variable from k regressor variables. Which is better? 1. (Bayesian Regression) Using the first 500 samples to estimate the parameters of an assumed prior distribution and then use the last 500 samples to update the prior to a posterior distribution with posterior estimates to be used in the final regression model. 2. (OLS Regression) Use a simple ordinary least squares regression model with all 1000 regressor variables

Erin
  • 177
  • 3
  • 14

2 Answers2

7

"Better" is always a matter of opinion, and it greatly depends on context.

Advantages to a frequentist OLS approach: Simpler, faster, more accessible to a wider audience (and therefore less to explain). A wise professor of mine used to say "You don't need to build an atom smasher when a flyswatter will do the trick."

Advantages to an equivalent Bayesian approach: More flexible to further model development, can directly model posteriors of derived/calculated quantities (there are more, but these have been my motivations for going Bayesian with a given analysis). Note the word "equivalent" - there are things you can do in a Bayesian framework that you can't do within a frequentist approach.

And hey, here's a exploration in R, first simulating data, then using a typical OLS approach.

N <- 1000
x <- 1:N
epsilon <- rnorm(N, 0, 1)
y <- x + epsilon

summary(lm(y ~ x))
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9053 -0.6723  0.0116  0.6937  3.7880 
## 
## Coefficients:
##              Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) 0.0573955  0.0641910    0.894    0.371    
## x           0.9999997  0.0001111 9000.996   <2e-16 ***
## ---
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
## 
## Residual standard error: 1.014 on 998 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 8.102e+07 on 1 and 998 DF,  p-value: < 2.2e-16

...and here's an equivalent Bayesian regression, using non-informative priors on the regression parameters and all 1000 data points.

library(R2jags)
cat('model {
  for (i in 1:N){
    y[i] ~ dnorm(y.hat[i], tau)
    y.hat[i] <- a + b * x[i]
  }
  a ~ dnorm(0, .0001)
  b ~ dnorm(0, .0001)
  tau <- pow(sigma, -2)
  sigma ~ dunif(0, 100)
}', file="test.jags")

test.data <- list(x=x,y=y,N=1000)
test.jags.out <- jags(model.file="test.jags", data=test.data, 
                  parameters.to.save=c("a","b","tau","sigma"), n.chains=3, n.iter=10000)
test.jags.out$BUGSoutput$mean$a
## [1] 0.05842661
test.jags.out$BUGSoutput$sd$a
## [1] 0.06606705
test.jags.out$BUGSoutput$mean$b
## [1] 0.9999976
test.jags.out$BUGSoutput$sd$b
## [1] 0.0001122533

Note that the parameter estimates and standard errors/standard deviations are essentially equivalent!

Now here's another Bayesian regression, using the first 500 data points to estimate the priors and then the last 500 to estimate posteriors.

test.data <- list(x=x[1:500],y=y[1:500],N=500)
test.jags.out <- jags(model.file="test.jags", data=test.data, 
                  parameters.to.save=c("a","b","tau","sigma"), n.chains=3, n.iter=10000)

cat('model {
  for (i in 1:N){
    y[i] ~ dnorm(y.hat[i], tau)
    y.hat[i] <- a + b * x[i]
  }
  a ~ dnorm(a_mn, a_prec)
  b ~ dnorm(b_mn, b_prec)
  a_prec <- pow(a_sd, -2)
  b_prec <- pow(b_sd, -2)
  tau <- pow(sigma, -2)
  sigma ~ dunif(0, 100)
}', file="test.jags1")

test.data1 <- list(x=x[501:1000],y=y[501:1000],N=500,
                   a_mn=test.jags.out$BUGSoutput$mean$a,a_sd=test.jags.out$BUGSoutput$sd$a,
                   b_mn=test.jags.out$BUGSoutput$mean$b,b_sd=test.jags.out$BUGSoutput$sd$b)
test.jags.out1 <- jags(model.file="test.jags1", data=test.data1, 
                  parameters.to.save=c("a","b","tau","sigma"), n.chains=3, n.iter=10000)

test.jags.out1$BUGSoutput$mean$a
## [1] 0.01491162
test.jags.out1$BUGSoutput$sd$a
## [1] 0.08513474
test.jags.out1$BUGSoutput$mean$b
## [1] 1.000054
test.jags.out1$BUGSoutput$sd$b
## [1] 0.0001201778

Interestingly, the inferences are similar to the OLS results, but not nearly as much so. This leads me to suspect that the 500 data points used to train the prior are not carrying as much weight in the analysis as the last 500, and the prior is effectively getting washed out, though I'm not sure on this point.

Regardless, I can't think of a reason not to use all 1000 data points (and non-informative priors) either, particularly since I suspect the 500+500 is using the first 500 and last 500 differently.

So perhaps, the answer to all of this is: I trust the OLS and 1000-point Bayesian results more than the 500+500, and OLS is simpler.

Matt Tyers
  • 2,125
  • 1
  • 14
  • 23
0

In my opinion is not a matter of better but a matter of which inference approach you're comfortable with.

You must remember that OLS comes from the frequentist school of inference and estimation is donde ML process which for this particular problem coincides with a geometric argument of distance minimization (in my personal opinion it is very odd, as supposedly we aare dealing with a rondom phenomena).

On the other hand, in the bayesian approach, inference is done through posterior distribution which is the multiplication of the prior (that represents the decision maker's previous information about the phenom) and the likelihood.

Again, the question is a matter of what inference approach you're comfortable with.