How to use R's optim() function with a function returning both the function value and the gradient?

Question

I have written a function which evaluates the Laplace approximate marginal likelihood of complex mixed models. The implementation uses sparse matrices with Eigen/RcppEigen, and the C++ autodiff library. It is available by installing the development version of the galamm package using the code in the line below. The installation requires compilation of C++ code, but the package passes R CMD check, so it should be easy to install for anyone with R set up for building packages.

remotes::install_github("LCBC-UiO/galamm", 
ref = "7b0b521bb6a9aeb2c93da39dc7a07fef61d81211")

In the package I have a function marginal_likelihood which evaluates the marginal likelihood of the model at given values of the parameters. The function returns a list containing two elements: (1) the deviance of the model, and (2) the gradient with respect to the parameters.

Although the function is intended for latent variable models used in psychometrics, for the purpose of this question it is sufficient to use a linear mixed model. In the code directly below, I confirm that the deviance is identical to the computed by lme4::lmer, using an example problem for lme4. It should also be clear the marginal_likelihood is intended to be a low-level function, not directly exposed to users.

library(lme4)
#> Loading required package: Matrix
library(galamm)

# Fit a linear mixed model using lme4's modular fitting functions
lmod <- lFormula(Reaction ~ Days + (Days|Subject), sleepstudy, REML = FALSE)
devfun <- do.call(mkLmerDevfun, lmod)
opt <- optimizeLmer(devfun)
mod1 <- mkMerMod(environment(devfun), opt, lmod$reTrms, fr = lmod$fr)

# Confirm that galamm::marginal_likelihood returns the correct value at the 
# maximum likelihood estimates
mod2 <- marginal_likelihood(
  y = sleepstudy$Reaction,
  trials = 0, # not used here
  X = lmod$X,
  Zt = lmod$reTrms$Zt,
  Lambdat = lmod$reTrms$Lambdat,
  beta = fixef(mod1),
  theta = getME(mod1, "theta"),
  theta_mapping = lmod$reTrms$Lind - 1L,
  lambda = numeric(), # factor loadings, not used here
  lambda_mapping_X = integer(),
  lambda_mapping_Zt = integer(),
  family = "gaussian"
)


-2 * logLik(mod1)
#> 'log Lik.' 1751.939 (df=6)
mod2$deviance
#> [1] 1751.939

# marginal_likelihood also returns the gradient wrt theta, beta, and lambda
mod2
#> $deviance
#> [1] 1751.939
#> 
#> $gradient
#> [1] -1.392026e-03 -1.483929e-03 -1.407926e-03  1.327937e-14 -1.724646e-13

^{Created on 2022-08-18 with reprex v2.0.2}

What I now want to do is use the two elements of the list returned by marginal_likelihood, i.e., $deviance and $gradient, to find marginal maximum likelihood estimates using optim(). Since the gradient is computed using automatic differentiation in C++, it is wasteful to compute this in a second run of the function. Hence, I have implemented the following "hack", which works upto some point:

library(lme4)
#> Loading required package: Matrix
library(galamm)
lmod <- lFormula(Reaction ~ Days + (Days|Subject), sleepstudy, REML = FALSE)

theta_inds <- seq(from = 1, to = length(lmod$reTrms$theta))
beta_inds <- seq(from = max(theta_inds) + 1, length.out = ncol(lmod$X))
bounds <- c(lmod$reTrms$lower, rep(-Inf, length(beta_inds)))

optfun <- function(par){
  ml <- marginal_likelihood(
    y = sleepstudy$Reaction,
    trials = 0, # not used here
    X = lmod$X,
    Zt = lmod$reTrms$Zt,
    Lambdat = lmod$reTrms$Lambdat,
    beta = par[beta_inds],
    theta = par[theta_inds],
    theta_mapping = lmod$reTrms$Lind - 1L,
    lambda = numeric(), # factor loadings, not used here
    lambda_mapping_X = integer(),
    lambda_mapping_Zt = integer(),
    family = "gaussian"
  )
  # Update gradient globally
  grad <<- ml$gradient
  ml$deviance
}

# Simply take the gradient from the global environment
gradfun <- function(par){
  grad
}

# Set gradient to NULL
grad <- NULL
par_init <- c(lmod$reTrms$theta, rep(1, length(beta_inds)))
opt <- optim(par_init, fn = optfun, gr = gradfun, lower = bounds,
             method = "L-BFGS-B")
opt
#> $par
#> [1]   0.92992479   0.01811702   0.22274704 251.40024419  10.46528199
#> 
#> $value
#> [1] 1751.939
#> 
#> $counts
#> function gradient 
#>      116      116 
#> 
#> $convergence
#> [1] 0
#> 
#> $message
#> [1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"

# Confirm that estimates are identical to lme4:
mod1 <- lmer(Reaction ~ Days + (Days|Subject), sleepstudy, REML = FALSE)

plot(getME(mod1, "theta"), opt$par[theta_inds]); abline(0, 1)

plot(fixef(mod1), opt$par[beta_inds]); abline(0, 1)

^{Created on 2022-08-18 with reprex v2.0.2}

It does converge to the right solution, but my has several problems, including:

It uses global assignment to update the gradient.
It is dependent on the fact the L-BFGS-B compute the gradient directly after evaluating the objective, for any given parameter value. For this reason, if I set hessian = TRUE in the argument to optim(), I get a matrix of zeros, since when evaluating the Hessian at the end of the iterations, this order does no longer hold.

My question thus is: Is there a nicer way of minimizing a function using optim() or other alternatives like optimx() or optimr(), when the function being optimized returns both the function value AND the gradient at the given parameters?

score 1 · Answer 1 · answered Aug 18 '22 at 11:18

I figured out one way of answering my question is by memoization, using memoise:

library(lme4)
#> Loading required package: Matrix
library(galamm)
library(memoise)
lmod <- lFormula(Reaction ~ Days + (Days|Subject), sleepstudy, REML = FALSE)

theta_inds <- seq(from = 1, to = length(lmod$reTrms$theta))
beta_inds <- seq(from = max(theta_inds) + 1, length.out = ncol(lmod$X))
bounds <- c(lmod$reTrms$lower, rep(-Inf, length(beta_inds)))

mlwrapper <- function(par){
  marginal_likelihood(
    y = sleepstudy$Reaction,
    trials = 0, # not used here
    X = lmod$X,
    Zt = lmod$reTrms$Zt,
    Lambdat = lmod$reTrms$Lambdat,
    beta = par[beta_inds],
    theta = par[theta_inds],
    theta_mapping = lmod$reTrms$Lind - 1L,
    lambda = numeric(), # factor loadings, not used here
    lambda_mapping_X = integer(),
    lambda_mapping_Zt = integer(),
    family = "gaussian"
  )
}

mlmem <- memoise(mlwrapper)
fn <- function(par){
  mlmem(par)$deviance
}
gr <- function(par){
  mlmem(par)$gradient
}

par_init <- c(lmod$reTrms$theta, rep(1, length(beta_inds)))
opt <- optim(par_init, fn = fn, gr = gr, 
             method = "L-BFGS-B", lower = bounds)
opt
#> $par
#> [1]   0.92925063   0.01814757   0.22265438 251.40567346  10.46733073
#> 
#> $value
#> [1] 1751.939
#> 
#> $counts
#> function gradient 
#>      107      107 
#> 
#> $convergence
#> [1] 0
#> 
#> $message
#> [1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"

# Confirm that estimates are identical to lme4:
mod1 <- lmer(Reaction ~ Days + (Days|Subject), sleepstudy, REML = FALSE)

plot(getME(mod1, "theta"), opt$par[theta_inds]); abline(0, 1)

plot(fixef(mod1), opt$par[beta_inds]); abline(0, 1)

^{Created on 2022-08-18 with reprex v2.0.2}

How to use R's optim() function with a function returning both the function value and the gradient?

1 Answers1