8

I am trying to run knnreg from the package caret. For some reason, this training set works:

> summary(train1)
       V1                V2             V3             
 13     : 10474   1      :  6435   7      :  8929     
 10     : 10315   2      :  6435   6      :  8895     
 4      : 10272   3      :  6435   9      :  8892     
 1      : 10244   4      :  6435   10     :  8892     
 2      : 10238   7      :  6435   15     :  8874     
 24     : 10228   8      :  6435   40     :  8870                        
 (Other):359799   (Other):382960   (Other):368218   

While this one won't work:

> summary(train2)
        V1              V2               V3                   V4      
 13     : 10474   1      :  6436   7      :  8929   Christmas   :  5946  
 10     : 10315   2      :  6436   6      :  8895   Labor Day   :  8861  
 4      : 10272   3      :  6438   9      :  8892   None        :391909  
 1      : 10244   4      :  6435   10     :  8892   Super Bowl  :  8895  
 2      : 10238   7      :  6435   15     :  8874   Thanksgiving:  5959  
 24     : 10228   8      :  6435   40     :  8870                        
 (Other):359799   (Other):382960   (Other):368218   

Here is the target vector:

> summary(Target)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  -499    200    712   1980   20210  693100 

The error I get is during the prediction phase:

> fit <- knnreg(train2, Target, k = 2)
> Prediction <- predict(fit,  newdata=test)
Error in knnregTrain(train = list(V1 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L,  : 
  NA/NaN/Inf in foreign function call (arg 5)
In addition: Warning messages:
1: In knnregTrain(train = list(V1 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L,  :
  NAs introduced by coercion
2: In knnregTrain(train = list(V1 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L,  :
  NAs introduced by coercion

While this is my test set:

> summary(test)
     V1            V2           V3                    V4      
 13     : 2836   1      :  1755   51     : 3002   Christmas   :  2988  
 4      : 2803   2      :  1755   49     : 2989   Labor Day   :     0  
 19     : 2799   3      :  1755   52     : 2988   None        :106136  
 2      : 2797   4      :  1755   50     : 2986   Super Bowl  :  2964  
 27     : 2791   7      :  1755   6      : 2984   Thanksgiving:  2976  
 24     : 2790   8      :  1755   47     : 2976                        
 (Other):98248   (Other):104534   (Other):97139     

What am I missing?

EDIT: Switching the V4 set labels to '1', '2', ... actually fixes the problem. Is the algorithm considers my features as numerical even though they're factors?

Richie Cotton
  • 118,240
  • 47
  • 247
  • 360
thecheech
  • 2,041
  • 3
  • 18
  • 25
  • 1
    Show us the code using `knnreg`. I imagine that you are not using the formula method and, if that is the case, you are mixing numeric and non-numeric data. `knnreg` will convert it to a matrix (which has to be the same type) and it probably ends up being converted to all character data. – topepo Mar 20 '14 at 18:43
  • @topepo you are right. I am using the data.frame class. Here is the code: `> fit <- knnreg(train2, Target, k = 2)` – thecheech Mar 20 '14 at 20:30
  • Run the data through `model.matrix` to convert the factors to dummy variables (and don't forget to center and scale the data too). It should work then. – topepo Mar 21 '14 at 00:38
  • @topepo Thanks when I convert the data through model.matrix I end up with very sparse matrix. By default the matrix is of type double, is there away to have it immediately as logical or factor (otherwise I get memory allocation error)? – thecheech Mar 21 '14 at 07:31
  • Also: there's no way to run a knn regression with categorical variables without converting them into dummy variables? Is it the same with all types of regression? – thecheech Mar 21 '14 at 07:39

1 Answers1

3

I realized that knnreg will receive only numerical values and when I tried to train the model with train1, it considered all values to be numerical (when in fact they are categorical). train2 returns an error because V4 is not numerical, and knnreg can't convert it into numerical either.

thecheech
  • 2,041
  • 3
  • 18
  • 25