1

Context of application
I have a model with random slopes and intercepts. There are numerous levels of the random effects. The new data (to be predicted) may or may not have all of these levels.

To make this more concrete, I am working with music revenue at the album level (title). Each album may come in multiple types format2 (CD, vinyl, e-audio, etc). I have measurements for revenue for each album at each type of album. The model is specified as:

lmer(physical~ format2+ (0+format2|title))

The problem is that future data may not have all the levels of either title or format2. For random intercepts, this is easily resolved with predict(..., allow.new.levels= TRUE). But it is problematic for the fixed effects and random slopes. I am therefore trying to write a function to do flexible predictions of merMod objects, similar to lme4::predict.merMod; but that will handle the differences between the training data and the prediction data. This is a question asked as much out of ignorance to the exact details of lme4::predict.merMod as anything else.

Description of problem
The crux of the problem is getting the correct model.matrix() with fixed and random effects to calculate both predictions and SE's. The S3 method for class merMod returns only the fixed effects.

The base stats::model.matrix() function has very limited documentation. Unfortunately, I do not own either Statistical Models in S or Software for Data Analysis, which appear to have the details behind these functions.

model.matrix() is supposed to take a model formula and new data frame and produce a design matrix. But I'm getting an error. Any help you can provide would be much appreciated.

Example Data

dat1 <- structure(list(dt_scale = c(16, 16, 16, 16, 16, 16, 16, 16, 16, 
16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16), title = c("Bahia", 
"Jazz Moods: Brazilian Romance", "Quintessence", "Amadeus: The Complete Soundtrack Recording (Bicentennial Edition)", 
"Live In Europe", "We'll Play The Blues For You", "The Complete Village Vanguard Recordings, 1961", 
"The Isaac Hayes Movement", "Jazz Moods: Jazz At Week's End", 
"Blue In Green: The Concert In Canada", "The English Patient - Original Motion Picture Soundtrack", 
"The Unique Thelonious Monk", "Since We Met", "You're Gonna Hear From Me", 
"The Colors Of Latin Jazz: Cubop!", "The Colors Of Latin Jazz: Samba!", 
"Homecoming", "Consecration: The Final Recordings Part 2 - Live At Keystone Korner, September 1980", "More Creedence Gold", "The Stardust Session"), format2 = c("CD", "CD", 
"CD", "CD", "CD", "CD", "CD", "SuperAudio", "SuperAudio", "CD", "E Audio", "CD", 
"Vinyl", "CD", "E Audio", "CD", "CD", "CD", "CD", "CD"), mf_day = c(TRUE, 
TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, 
TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE), xmas = c(FALSE, 
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, 
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, 
FALSE), vday = c(FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, 
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, 
FALSE, FALSE, FALSE, FALSE), yr_since_rel = c(16.9050969937038, 
8.41815617876864, 9.2991404674865, 25.0870296783559, 39.1267038232812, 
27.9156764326061, 9.11596751812513, 23.3052837112449, 14.3123922258974, 
30.5208152866414, 5.83025071417496, 21.3090003877291, 7.75022155568392, 
11.3601605287827, 0.849006673421519, 31.9918631305662, 13.8861905547041, 
12.8342695062012, 29.6916671402534, 13.5912612705038), physical = c(1327.17849171096, 
-110.2265302258, -795.37376268564, 355.06192702004, -1357.3492884345, 
-1254.93442612023, -816.713683621225, 881.201935773452, -3092.02845691036, 
-2268.6304275652, 907.347941142021, -699.130275178185, 377.867849132077, 
-1047.50531157311, 1460.25978951805, 1376.84579069304, 3619.03629114089, 
962.888173535704, 2514.77880599199, 2539.14958588771)), .Names = c("dt_scale", 
"title", "format2", "mf_day", "xmas", "vday", "yr_since_rel", 
"physical"), row.names = c(1L, 2L, 5L, 6L, 7L, 8L, 9L, 11L, 12L, 
13L, 14L, 15L, 20L, 22L, 23L, 25L, 27L, 32L, 35L, 36L), class = "data.frame")

formula:

f1 <- as.formula(~1 + dt_scale + yr_since_rel + format2 + (0 + format2 + mf_day + 
xmas + vday | title))

execution / error

library(lme4)
model.matrix(f1, data= dat1)
Error in 0 + format2 : non-numeric argument to binary operator

Note I've also tried this with the Orthodont data; but, I get a different error.

library(lme4)
data("Orthodont",package="MEMSS")
fm1 <- lmer(formula = distance ~ age*Sex + (1+age|Subject), data = Orthodont)
newdat <- expand.grid(
  age=c(8,10,12,14)
  , Sex=c("Male","Female")
  , distance = 0
  , Subject= c("F01", "F02")
)


f1 <- formula(fm1)[-2] # simpler code via Ben Bolker below
mm <- model.matrix(f1, newdat) # attempt to use model.matrix
Warning message
In Ops.factor(1 + age, Subject) : | not meaningful for factors

# use lme4:::mkNewReTrms as suggested in comments
mm <- lme4:::mkNewReTrms(f1, newdat) 
Error in lme4:::mkNewReTrms(f1, newdat) : object 'ReTrms' not found
In addition: Warning message:
In Ops.factor(1 + age, Subject) : | not meaningful for factors

# check if different syntax would fix this
mm <- lme4::mkNewReTrms(f1, newdat)
Error: 'mkNewReTrms' is not an exported object from 'namespace:lme4'
mm <- mkNewReTrms(f1, newdat)
Error: could not find function "mkNewReTrms"
alexwhitworth
  • 4,839
  • 5
  • 32
  • 59
  • I have a few questions/comments. (1) the sample data you include only has a single value of `format2`, so the model as specified doesn't work (presumably your real data has more). (2) your example continues non-reproducibly; what are `b1` and `a`? (3) your `f1` formula looks suspicious; do *all* of those effects vary within levels of `title`, and do you have enough data to estimate the (correlated) among-title variability in all of them? (4) **missing** levels of either a fixed or a random effect level in new data for prediction aren't a problem; it's **extra** levels that are troublesome – Ben Bolker Aug 27 '14 at 23:27
  • if you want to construct a new random-effects model matrix, you can use `lme4:::mkNewReTrms(object,newdata,re.form)` where `object` is a formula; then extract and transpose the `$Zt` component of the resulting object – Ben Bolker Aug 27 '14 at 23:30
  • `formula(fm1)[-2]` will get you the RHS of the formula more easily. Finally, can you show a (reproducible) example of a case where your prediction data does *not* work in the way you are concerned about? As I said in comment #1 part 4, I'm not convinced yet/don't understand whether there is in fact any difficulty in doing what you want to do with the built-in `predict` function. – Ben Bolker Aug 27 '14 at 23:34
  • @BenBolker -- updated (1) and (2). Re (3), the data is unbalanced. So, there is variation in the random slope terms among titles but not necessarily a large number of cases for each. RE (4), I can't provide the full data (proprietary). But I all extra levels have been cleaned. I'll look into `lme4:::mkNewReTrms()`. – alexwhitworth Aug 28 '14 at 00:03
  • "How can I construct the model matrix (usually denoted as Z) for the random-effects part of the model?" is a reasonable question, but I'm still not convinced that you have presented any examples where you actually need to do that. Ideally you would generate a small/fake reproducible example (e.g. with the `Orthodont` data) showing what you are trying to do (your ultimate goal, i.e. predicting with new and/or missing levels of random and/or fixed effects) – Ben Bolker Aug 28 '14 at 00:23
  • @BenBolker -- additional problems using the `Orthodont` data illustrated. – alexwhitworth Aug 28 '14 at 17:21
  • Alas, we are still failing to communicate. I am less interested in the **proximal** question ("how do I construct an appropriate new Z matrix?") than the **ultimate** question ("how do I predict for new data with missing levels?"). For example, I just tried `predict(fm1,newdata=data.frame(age=6,Sex="Male",Subject="F01"))` which (surprisingly to me) did **not** work (`Error: sum(nb) == q is not TRUE`) -- but I am more interested in fixing `predict.merMod` to work in this case than in developing new methods ... – Ben Bolker Aug 28 '14 at 17:29
  • @BenBolker I am also more interested in the **ultimate** question, as you phrase it. Specifically, I want both predictions and SE's. But seeing as `predict.merMod()` is not currently working; I'm currently after a roundabout method. – alexwhitworth Aug 28 '14 at 17:46
  • https://github.com/lme4/lme4/issues/143 – Ben Bolker Aug 28 '14 at 18:06

1 Answers1

0

Editted 8/12/15: see changes on Github and GitHub Repo

Editted, 10/15/2014: This answer isn't yet perfect. There are still a couple of use-cases with errors (see comment chain below). But it works in most cases. I'll get around to finalizing it at some point.

I believe this function will solve the more important problem, accurate predictions for merMod objects. Dr Bolker, there are still some issues here (such as sparsity and efficiency); but I believe the method works:

data("Orthodont",package="MEMSS")
fm1 <- lmer(formula = distance ~ age*Sex + (1+age|Subject), data = Orthodont)
newdat <- expand.grid(
  age=c(8,10,12,14)
  , Sex=c("Male","Female")
  , distance = 0
  , Subject= c("F01", "F02")
)

predict.merMod2 <- function(object, newdat=NULL) {
# 01. get formula and build model matrix
  # current problem--model matrix is not sparse, as would be ideal
  f1 <- formula(object)[-2]
  z.fe <- model.matrix(terms(object), newdat)
  z.re <- t(lme4:::mkReTrms(findbars(f1), newdat)$Zt)
  mm <- cbind(z.fe, 
              matrix(z.re, nrow= dim(z.re)[1], ncol= dim(z.re)[2],
                     dimnames= dimnames(z.re)))

  # 02. extract random effect coefficients needed for the new data
  # (a) - determine number of coef
  len <- length(ranef(object)) 
  re.grp.len <- vector(mode= "integer", length= len) 
  for (i in 1:len) { # for each random group
    re.grp.len[i] <- dim(ranef(object)[[i]])[2] # number of columns (slope and intercept terms)
  }

  # (b) - create beta vector
  fe.names <- unique(colnames(mm)[1:length(fixef(object)) - 1])
  re.names <- unique(colnames(mm)[-c(1:length(fixef(object)) - 1)]) 
  beta.re <- as.vector(rep(NA, length= sum(re.grp.len) * length(re.names)), mode= "numeric")
  for (i in 1:len) {
    re.beta  <- ranef(object)[[i]][rownames(ranef(object)[[i]]) %in% re.names,] 
    ind.i <- sum(!is.na(beta.re)) + 1; ind.j <- length(as.vector(t(re.beta))) 
    beta.re[ind.i:ind.j]  <- as.vector(t(re.beta)) 
  }
  beta <- c(fixef(object)[names(fixef(object)) %in% fe.names], beta.re)
  # 03. execute prediction
  return(mm %*% beta)
}

predict.merMod2(fm1, newdat)
alexwhitworth
  • 4,839
  • 5
  • 32
  • 59
  • I tested this on 5 of my models for the music data (as described initially). It worked on 4 of 5 (ie- identified an error). I'll update this answer to deal with that as soon as I can get to it, probably tomorrow. – alexwhitworth Aug 29 '14 at 00:40
  • I think I've fixed the problem in lme4 (see https://github.com/lme4/lme4/commit/787987394daf395332debbfd497ba375102f4ecf ). Are you willing to try re-installing from Github (`devtools::install_github("lme4","lme4")`) and trying your test cases? You'd need compilers etc. -- if this is a problem contact me and I'll build a binary for you ... – Ben Bolker Sep 04 '14 at 21:06
  • @BenBolker thanks--ill check into it when I'm back from vacation next week. – alexwhitworth Sep 11 '14 at 13:41
  • @Alex : You've promised @BenBolker to check the development version of `lme4` (available from github)... but may have forgotten about that, after your vacation ? – Martin Mächler Aug 11 '15 at 16:21
  • @MartinMächler The dev version (`1.1.8` at the time) wasn't working for my needs, which is why I forked the lme4/lme4 repo and updated `predict.merMod` per our earlier emails. – alexwhitworth Aug 11 '15 at 16:35
  • @Alex: thanks for the explanation. Continued here [https://github.com/lme4/lme4/issues/303]. My conclusion: It has been a bug in lme4, fixed a long while ago in the "development" version, 1.1-9 not yet on CRAN. – Martin Mächler Aug 12 '15 at 08:44