Avoid failing when a factor has new levels in test set

Question

I have a dataset, which I am splitting into train and test subsets in the following way:

train_ind <- sample(seq_len(nrow(dataset)), size=(2/3)*nrow(dataset))
train <- dataset[train_ind]
test <- dataset[-train_ind]

Then, I use it to train a glm:

glm.res <- glm(response ~ ., data=dataset, subset=train_ind, family = binomial(link=logit))

And finally, I use it to predict on my test set:

preds <- predict(glm.res, test, type="response")

Depending on the sample, this fails with error:

Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : factor has new levels

Note that the value appears on the full dataset, but apparently not on the training set. What I want to do is make the predict function ignore these new factors. Even if it has performed binarization for the factors, I don't see why it can assume that new values (thus, not variables in the linear model) are simply 0, that would yield the correct behaviour.

Is there a way to do this?

It's not fair to say that it's correct behaviour for a model to assume new values should be treated as '0'. E.g. imagine you have a predictor 'eye colour' and your model was trained on data which only contained 'brown' and 'blue'. If this variable was coded with brown = 0 and blue = 1 and your test data now contains someone with green eyes, treating this row as a '0' as per your suggestion would assume they had brown eyes. More generally, it doesn't make sense to ask a model to make predictions about things it's never been exposed to during training, which is why you're getting the error — jruf003, Apr 22 '19 at 06:19
There was the implied assumption in my question (which, admittedly, I should have clarified) that the value 0 was representing a "missing" or "null" value. Also please note that I was not asking how to force a model (i.e. the glm function) to deal with data it has never seen before, but rather how, I, as a programmer could solve an issue in R that can happen in a realistic scenario. — Setzer22, Apr 23 '19 at 07:48
I stand by my original post. I don't think it makes sense to code missing values with zeros in R - they'll either get treated as 'true' zeroes for numeric variables (a bad idea) or as their own level for factors. The possible exception here is if you want to model all missing values as their own category. — jruf003, May 06 '19 at 04:54
@jruf003 I agree with you. My remark was just to highlight that despite that inconsistency the purpose of my question remains valid. I was asking the question with some implicit assumptions on the value 0 that made sense for whatever project I was working on at the time (and which unfortunately I don't remember!). It could be that I really meant "N/A" instead of 0, or that in my particular use-case, the numerical value zero was a good guess for N/As: For example, 0 is the mean value when your data is centered and scaled, which is a more or less sensible treatment (even if suboptimal) for N/As. — Setzer22, May 07 '19 at 08:03

score 1 · Accepted Answer · edited Oct 25 '22 at 13:21

I start with the following data generating process (a binary response variable, one numerical independent variable and 3 categorical independent variables):

set.seed(1)
n <- 500
y <- factor(rbinom(n, size=1, p=0.7))
x1 <- rnorm(n)
x2 <- cut(runif(n), breaks=seq(0,1,0.2))
x3 <- cut(runif(n), breaks=seq(0,1,0.25))
x4 <- cut(runif(n), breaks=seq(0,1,0.1))
df <- data.frame(y, x1, x2, x3, x4)

Here I build the training and testing set in a way to have some categorical covariates (x2 and x3) in the testing set with more categories than in the training set:

idx <- which(df$x2!="(0.6,0.8]" & df$x3!="(0,0.25]")
train_ind <- sample(idx, size=(2/3)*length(idx))
train <- df[train_ind,]
train$x2 <- droplevels(train$x2)
train$x3 <- droplevels(train$x3)
test <- df[-train_ind,]

table(train$x2)
(0,0.2] (0.2,0.4] (0.4,0.6]   (0.8,1] 
     55        40        53        49 

table(test$x2)
(0,0.2] (0.2,0.4] (0.4,0.6] (0.6,0.8]   (0.8,1] 
     58        48        45        90        62 

table(train$x3)
(0.25,0.5] (0.5,0.75]   (0.75,1] 
        66         61         70 

table(test$x3)
(0,0.25] (0.25,0.5] (0.5,0.75]   (0.75,1] 
     131         63         47         62

Of course, predict yields the message error that is described above by @Setzer22:

glm.res <- glm(y ~ ., data=train, family = binomial(link=logit)) 
preds <- predict(glm.res, test, type="response")

Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : factor x2 has new levels (0.6,0.8]

Here is a (not elegant) way to delete rows of test which have new levels in the covariates:

dropcats <- function(k) {
   xtst <- test[,k]
   xtrn <- train[,k]
   cmp.tst.trn <- (unique(xtst) %in% unique(xtrn))
   if (is.factor(xtst) & any(!cmp.tst.trn)) {
      cat.tst <- unique(xtst)
      apply(test[,k]==matrix(rep(cat.tst[cmp.tst.trn],each=nrow(test)),
                      nrow=nrow(test)),1,any)
   } else {
      rep(TRUE,nrow(test))
   }
}   
filt <- apply(sapply(2:ncol(df),dropcats),1,all)
subset.test <- test[filt,]

In the subset subset.test of the testing set x2 and x3 have no new categories:

table(subset.test[,"x2"])
  (0,0.2] (0.2,0.4] (0.4,0.6] (0.6,0.8]   (0.8,1] 
       26        25        20         0        28

table(subset.test[,"x3"])
  (0,0.25] (0.25,0.5] (0.5,0.75]   (0.75,1] 
         0         29         29         41

Now predict works nicely:

preds <- predict(glm.res, subset(test,filt), type="response")
head(preds)

       30        39        41        49        55        56 
0.7732564 0.8361226 0.7576259 0.5589563 0.8965357 0.8058025

Hope this can help you.

Thanks for your answer! I find it discouraging that there seems to be no easy way to do this in R. It seems to me like a basic edge-case the implementation should be covering, and the solution is straightforward. Is there something I didn't take into account? Why cannot it just ignore any new values? — Setzer22, May 15 '17 at 07:26

Avoid failing when a factor has new levels in test set

1 Answers1