2

I'm building my own maximum likelihood estimator that estimates the parameters associated with the mean and standard deviation. On simulated data my function works when the true mean is a linear function and the standard deviation is constant. However, if the mean structure is polynomial my function cannot recover the true parameters. Can anybody point me to a solution?

I'm aware there are plenty of existing functions for estimating means and SDs. I'm not interested in them, I'm interested in why my function is not working.

Below is a reproducible example where my model does not recover the true standard deviation (true sd = 1.648, mysd = 4.184123)

*Edit: added library()

library(tidyverse)
my_poly_loglik <- function(pars, #parameters 
                           outcome, #outcome variable
                           poly_mean){ #data frame of polynomials
  
  #modelling the mean - adding intercept column
  mean_mdl = cbind(1, poly_mean) %*% pars[1:(ncol(poly_mean) + 1)]

  #modelling the standard deviation on exponential scale
  sd_mdl = exp(pars[length(pars)])
  
  #computing log likelihood
  sum_log_likelihood <- sum(dnorm(outcome,
                                  mean = mean_mdl,
                                  sd = sd_mdl, 
                                  log = TRUE), 
                            na.rm = TRUE)
  #since optim() is minimizing we want the -log likelihood
  return(-sum_log_likelihood)
}


#Generate data
set.seed(103)
n <- 100000 #100k obs
z <- runif(n, min = 0.1, max = 40) #independent variable sampled uniformly
mean <- 10 + 0.2 * z + 0.4 * z^2  #mean structure
sd = exp(0.5) #constant SD
y <- rnorm(n,mean, sd)
#Visualizing simulated data
#plot(z,mean) 
#plot(z,sd)
#plot(z,y)
mydf = data.frame(z,y)


#Defining polynomials
polymean = cbind(z, z^2)
#Initial values. 2 extra for mean_intercept and SD
pars = rep(0, ncol(polymean) + 2) 
#Optimising my likelihood function
optim_res <- optim(pars, 
                   fn = my_poly_loglik, 
                   outcome = mydf$y,
                   poly_mean = polymean) 
if (optim_res$convergence != 0) stop("optim_res value is not 0!")


#comparing my function to the real parameter
plot_df = data.frame("mymean" = optim_res$par[1] + (polymean %*% optim_res$par[2:3]),
                     "truemean" = mean,
                     "z" = z)
#my mean (black) and true mean (red)
plot_df %>% 
  ggplot(aes(x = z, y = mymean)) +
  geom_line() +
  geom_line(aes(y = truemean), color = "red")
#Works!

#my SD and true SD - PROBLEM!
sd #true sd
exp(optim_res$par[length(optim_res$par)]) #my sd

WiggyStardust
  • 182
  • 1
  • 10

1 Answers1

2

this is not a complete solution but it might help others find the correct answer.

The code looks good overall and the issue emerges only with a high range of the z values. In fact, scaling them or generating data from a considerably lower range leads to the correct solution. Furthermore, checking the hessian shows that the covariance matrix of the estimates is not positive semidefinite and slightly reducing the range results in correlations of the mean parameters close to 1. (This is a bit puzzling since a normal linear model with the same parametrization does not suffer from the same issue -- I know it does not optimize the likelihood directly, but still a bit unintuitive to me).

So, a temporal solution might be rescaling the predictors / using an orthogonal parametrization? But that does not really explain core of the issue.