Why does predict, on a subset of the original data, fail using gls when lm does the trick?

Question

The problem is illustrated using the code below. If you run it you will see that lm handles the predict gracefully while gls fails to do so. This is most likely a problem in predict.gls but I don't understand why. This is only an issue when using the factor call. Without it everything works out fine. I am fairly confident that predict.gls fails because all levels are not present in the new dataset. However, lm works it out. To me it feels like a bug but I'm not proficient enough with the gls code to determine it.

library(nlme)

# lm example
myfit<-lm(mpg~factor(cyl):disp+hp, data=mtcars)
mypred<-predict(myfit, mtcars[1:3, 1:7])

# gls example
myfit2<-gls(mpg~factor(cyl):disp+hp, data=mtcars)
mypred2<-predict(myfit2, mtcars[1:3, 1:7])

It fails with the error:

# Error in X[, names(cf), drop = FALSE] : subscript out of bounds

Any ideas?

My R.version output:

platform x86_64-pc-linux-gnu
arch x86_64
os linux-gnu
system x86_64, linux-gnu
status
major 3
minor 0.2
year 2013
month 09
day 25
svn rev 63987
language R
version.string R version 3.0.2 (2013-09-25) nickname Frisbee Sailing

nlme package version: "package ‘nlme’ version 3.1-113"

I suppose that one could argue that `predict.gls` should use the `xlev` argument in `model.frame`. But the authors might argue that you really should be converting the column to factor up front in your data. — joran, Mar 06 '14 at 16:18
I believe the error is caused by cyl==8 missing from newdata. `mypred2<-predict(myfit2, mtcars[1:5, 1:7])` works. This might be considered a bug in how `predict.gls` handles factors. — Roland, Mar 06 '14 at 16:19
@joran You get the same error after `mtcars$cyl <- factor(mtcars$cyl); myfit2<-gls(mpg~cyl:disp+hp, data=mtcars)`. — Roland, Mar 06 '14 at 16:21
@Roland If thats the case, then it does seem like an oversight in their use of model.frame — joran, Mar 06 '14 at 16:26
@RScriv I'm pretty sure that, while you are right in general, that is not the source of this particular error. — joran, Mar 06 '14 at 17:20
In the code I can see the drop unused levels flag being set to true in the gls case just like in the lm case so that to me suggests that the "missing" values in a factor should pose no problem. Or am I misinterpreting it? — Dr. Mike, Mar 07 '14 at 07:47
@RScriv You shouldn't look at the usefulness or quality of the model provided. It's just a dummy example to illustrate the issue. :) — Dr. Mike, Mar 07 '14 at 07:50
I've encountered this error as well (very much late to the party, but the bug remains unfixed). A quick and dirty workaround is just: rbind a data frame with all possible factor values to the data you're passing as newdata to the predict function, and then remove it again from the result. — Erik A, Aug 12 '19 at 16:01

score 2 · Accepted Answer · answered Mar 07 '14 at 17:12

Since I'm not the author of predict.gls I can't answer precisely why it was written this way, but I'm hesitant to suggest that this is a bug in a function/package that have been around this long.

Anyway...the reason it works with lm is that when predict.lm calls model.frame:

m <- model.frame(Terms, newdata, na.action = na.action, 
            xlev = object$xlevels)

it is using the xlev argument and the information on the levels from the fitted model object itself. I don't believe that gls object stores this information at all.

In predict.gls there are two things going on. First, the initial call to model.frame:

mfArgs <- list(formula = form, data = newdata, na.action = na.action)
mfArgs$drop.unused.levels <- TRUE
dataMod <- do.call("model.frame", mfArgs)

Note that here we're not using the xlev argument, and in fact we're explicitly saying to drop unused levels. If you create your own version of predict.gls and comment our the drop.unused.levels line, it should work, as long as you don't call factor in your formula!

Why?

Because later on we see this:

X <- model.matrix(form, dataMod)

where your formula is being re-applied. Which means that factor is being called on columns with levels that don't exist, forcing them to be dropped.

So when I use a version of predict.gls that comments out the drop.unused.levels line, and I set cyl to be a factor up front in the data frame, it generates predictions for me just fine.

But I would suggest asking the package authors as to whether this is intended behavior or not. It seems odd to me, but like I said, for a package this old and well used it seems strange that something like this would be un-intended.

I think this is a bug. I reported something similar as a bug today: https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17228 — Bill Denney, Feb 22 '17 at 18:44

Why does predict, on a subset of the original data, fail using gls when lm does the trick?

1 Answers1