1

I notice searching through stackoverflow for similar questions that this has been asked several times hasn't really been properly answered. Perhaps with help from other users this post can be a helpful guide to programming a numerical estimate of the parameters of a multivariate normal distribution.

I know, I know! The closed form solutions are available and trivial to implement. In my case I am interested in modifying the likelihood function for a specific purpose and I don't expect an exact analytic solution so this is a test case to check the procedure.

So here is my attempt. Please comment. Especially if I am missing opportunities for optimization. Note, I'm not a statistician so I'd appreciate any pointers.

ll_multN <- function(theta,X) {

  # theta = c(mu, diag(Sigma), Sigma[upper.tri(Sigma)])
  # X is an nxk dataset

  # MLE: L = - (nk/2)*log(2*pi) - (n/2)*log(det(Sigma)) - (1/2)*sum_i(t(X_i-mu)^2 %*% Sigma^-1 %*% (X_i-mu)^2)
  # summation over i is performed using a apply call for efficiency

  n <- nrow(X)
  k <- ncol(X)

  # def mu
  mu.vec <- theta[1:k]

  # def Sigma
  Sigma.diag <- theta[(k+1):(2*k)]
  Sigma.offd <- theta[(2*k+1):length(theta)]
  Sigma <- matrix(NA, k, k)
  Sigma[upper.tri(Sigma)] <- Sigma.offd
  Sigma <- t(Sigma)
  Sigma[upper.tri(Sigma)] <- Sigma.offd

  diag(Sigma) <- Sigma.diag

  # compute summation 
  sum_i <- sum(apply(X, 1, function(x) (matrix(x,1,k)-mu.vec)%*%solve(Sigma)%*%t(matrix(x,1,k)-mu.vec)))

  # compute log likelihood
  logl <- -.5*n*k*log(2*pi) - .5*n*log(det(Sigma)) 
  logl <- logl - .5*sum_i
  return(-logl)
}

Simulated dataset generated using the rmvnorm() function in the package "mvtnorm". Random positive definite covariance matrix generated using the additional function Posdef() (taken from here: https://stat.ethz.ch/pipermail/r-help/2008-February/153708)

library(mvtnorm)

Posdef <- function (n, ev = runif(n, 0, 5)) { 
  # generates a random positive definite covariance matrix
  Z <- matrix(ncol=n, rnorm(n^2))
  decomp <- qr(Z)
  Q <- qr.Q(decomp) 
  R <- qr.R(decomp)
  d <- diag(R)
  ph <- d / abs(d)
  O <- Q %*% diag(ph)
  Z <- t(O) %*% diag(ev) %*% O
  return(Z)
}

set.seed(2)

n <- 1000 # number of data points
k <- 3 # number of variables
mu.tru <- sample(0:3, k, replace=T) # random mean vector
Sigma.tru <- Posdef(k) # random covariance matrix
eigen(Sigma.tru)$val # check positive def (all lambda > 0)

# Generate simulated dataset
X <- rmvnorm(n, mean=mu.tru, sigma=Sigma.tru)

# initial parameter values
pars.init <- c(mu=rep(0,k), sig_ii=rep(1,k), sig_ij=rep(0, k*(k-1)/2))

# limits for optimization algorithm
eps <- .Machine$double.eps  # get a small value for bounding the paramter space to avoid things such as log(0).
lower.bound <- c(rep(-Inf,k), # bound on mu
                 rep(eps,k), # bound on sigma_ii
                 rep(-Inf,k)) # bound on sigma_ij i=/=j
upper.bound <- c(rep(Inf,k), # bound on mu
                 rep(100,k), # bound on sigma_ii
                 rep(100,k)) # bound on sigma_ij i=/=j

system.time(
  o <- optim(pars.init,
             ll_multN, X=X, method="L-BFGS-B",
             lower = lower.bound,
             upper = upper.bound)  
)

plot(x=c(mu.tru,diag(Sigma.tru),Sigma.tru[upper.tri(Sigma.tru)]),
     y=o$par,
     xlab="Parameter",
     ylab="Estimate",
     pch=20)
abline(c(0,1), col="red", lty=2)

This currently runs on my laptop in

   user  system elapsed 
 47.852  24.014  24.611

and gives this graphical output: Estimated mean and variance

In particular any advice on limit setting or algorithm choice would be much appreciated.

Thanks

jlouis
  • 11
  • 3
  • I noticed that I was miss-defining Sigma in the function `ll_multN`; `Sigma[upper.tri(Sigma)] <- Sigma[lower.tri(Sigma)] <- Sigma.offd` is non-symmetric due to R indexing norms. Still looking for any insights on optimization, limit setting or algorithm choice...! – jlouis Jun 08 '20 at 11:38

0 Answers0