CVlm with categorical variables: factor has new levels

Question

I am using lm for MLR and CVlm for cross-validation. My data contains two categorical variables (one of them with 11 levels and the other one with only 2). Everything seems to work fine when using lm, the problem is when I try to use CVlm. I have errors because of the factor levels. I read some post about that, although I don't understand very well (for CVlm I am using the same data that for CVlm so, I don't know why this error and how I could handle it). Here, it is a sample of my data:

      dput(head(data))
      structure(list(LagO3 = c(35.0092884462795, 37.7681232441784, 
      31.9993881550014, 32.5950690475087, 37.2233826323784, 42.531864470374
      ), Z = c(165.252173124639, 166.145467346544, 161.857655081398, 
      177.043656853793, 200.269306623339, 207.772978087346), RH = c(86.4605102539062, 
      93.2499008178711, 87.1677398681641, 81.0183639526367, 74.1963653564453, 
      78.7728729248047), SR = c(310.165555555556, 343.304444444444, 
      329.844444444444, 299.145555555556, 319.321111111111, 327.731111111111
      ), ST = c(320.032313368056, 286.879364149306, 295.939059244792, 
      319.065705295139, 316.955619574653, 297.229990234375), TC = c(0.0362091064453125, 
      0.171852111816406, 0.607879638671875, 0.770919799804688, 0.553321838378906, 
      0.04547119140625), Tmx = c(289.281782049361, 289.283827735997, 
      289.913899219804, 288.649664878918, 289.756381348852, 290.302579680594
      ), Wd = c(11.0027627927081, 2.83403791472211, 3.69153840122015, 
      6.65367358341413, 4.17920155713043, 5.35254406830185), CWT = structure(c(1L, 
      9L, 5L, 4L, 4L, 4L), .Label = c("A", "C", "E", "N", "NE", "NW", 
      "S", "SW", "U", "W"), class = "factor"), LW = structure(c(1L, 
      2L, 2L, 2L, 2L, 1L), .Label = c("0", "LW"), class = "factor"), 
      o3 = c(37.7681232441784, 31.9993881550014, 32.5950690475087, 
      37.2233826323784, 42.531864470374, 48.3496367346306)), .Names = c("LagO3", 
      "Z", "RH", "SR", "ST", "TC", "Tmx", "Wd", "CWT", "LW", "o3"), row.names = c(NA, 
      6L), class = "data.frame")

This would be my model:

   model<-  lm(formula = o3 ~ LagO3 + Z + RH + ST + TC + Tmx + Wd + CWT, 
       data = data, na.action = na.exclude)

When I try to do CV:

      cvlm.mod <- CVlm(na.omit(data),model,m=10)

I have the error:

Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : 
  factor CWT has new levels S

The data$CWT has the levels: levels(data$CWT) [1] "A" "C" "E" "N" "NE" "NW" "S" "SW" "U" "W"

I figured out that the error might happen because data$CWT=="S" occurs only once (among the 920 observations of the data)...so my guess is that error appears due to that, since, adding one more value for "S" in data$CWT, CVlm works fine. But, I am still stuck, I don't know how I can handle this kind of cases.

Thanks again!!!

In your data there are levels that are not in the "CWT". Try after dropping the levels `droplevels(data)` — akrun, Feb 02 '15 at 09:07
I tried it..but it is not working..actually I am not very sure about your comment " In your data there are levels that are not in the "CWT""? data$CWT contains (now) 10 levels ..it is why I am confused about your comment (and the error)...anyway,thanks a lot for the suggestion! — user3231352, Feb 02 '15 at 09:17
I meant in the dput output, `unique(data$CWT) #[1] A U NE N #Levels: A C E N NE NW S SW U W` May be your original data has all the levels. It may be better to post an example that mimics the original one. — akrun, Feb 02 '15 at 09:18
I would like that ...but I have too many data :( ..it is why I only put the head ...sorry :( — user3231352, Feb 02 '15 at 09:21

LyzandeR · Accepted Answer · 2015-02-02T21:19:56.930

3

This is the typical problem of having different levels in the factor variables between the folds in the cross validation. The algorithm creates dummy variables for the training set but the test set has different levels to the training set and thus the error. The solution is to create the dummy variables on your own and then use the CVlm function:

Solution

dummy_LW <- model.matrix(~LW, data=df)[,-1]    #dummy for LW
dummy_CWT <- model.matrix(~CWT, data=df)[,-1]  #dummies for CWT
df <- Filter(is.numeric,df)                    #exclude LW and CWT from original dataset
df <- cbind(df,dummy_LW,dummy_CWT)             #add the dummies instead

Then run the model as you did (make sure you add the new variable names):

model<-  lm(formula = o3 ~ LagO3 + Z + RH + ST + TC + Tmx + dummy_LW + 
                           CWTC + CWTE + CWTN + CWTNE + CWTNW + CWTS + 
                           CWTSW + CWTU + CWTW, 
            data = df, na.action = na.exclude)
cvlm.mod <- CVlm(na.omit(data),model,m=10)

Unfortunately, I cannot test the above as your code has too few rows to work (only 6 rows are not enough) but the above will work.

A few words about model.matrix:

It creates dummy variables for categorical data. By default is leaves one level out as the reference level (as it should), because you will have a correlation of 1 between dummies otherwise. [,-1] in the above code just removes the intercept which is an unneeded column of 1s.

edited Feb 02 '15 at 21:19

answered Feb 02 '15 at 20:27

LyzandeR

37,047
12
77
87

Hi!! thanks a lot!! it seems that it is working :), nevertheless, I have a question (sorry if it is too basic, but I am starting with that). So far, I was using CWT and LW as a factor, since they are categorical....then if I make the change into a dummy (as in the example), I would expect similar results, am I right?? I mean, it is also correct to use factor for categorical variables (instead creating "dummies") , I see the advantage of creating dummies when using CVgam or CVlm....Many thanks again for help! – user3231352 Feb 03 '15 at 13:21
Even if you don't create the dummies explicitly like above, the function `lm` will create those using `model.frame` internally. So, it is exactly the same. I always find it better to calculate them on my own because I feel I am in control. The results would be exactly the same either you use the above or you just use the factor in an `lm` (taking into account the the reference level i.e. the one left out will be the same). – LyzandeR Feb 03 '15 at 13:42
I hope this makes sense now and happy to have helped :) – LyzandeR Feb 03 '15 at 13:48
Regarding the reference level, I have one more question (sorry, I am trying to understand it better how it works )..does it mean that I would lose information about one level, the reference one?? (i.e I saw that the summary model never includes the first level -for instance, type "A", then I was wondering if I am losing this information when I am fitting the model, am I wrong??does it make sense? Thanks again for the help! – user3231352 Feb 03 '15 at 17:18
You are not losing any information at all. The reference value is excluded because essentially it is being represented by all the other levels together. This means that wherever all the other levels are zero the reference level would be one. The algorithm knows that. Effectively, if you included the reference level it would be like having the same value twice, as the reference is correlated 100% with the combination of the rest. Try including that level in as a test and you will see that you will get NAs for one of the level estimates. – LyzandeR Feb 03 '15 at 17:25

CVlm with categorical variables: factor has new levels

1 Answers1