5

Im currently practicing R on the Kaggle using the titanic data set I am using the Random Forest Algorthim

Below is the code

fit <- randomForest(as.factor(Survived) ~ Pclass + Sex + Age_Bucket + Embarked
                + Age_Bucket + Fare_Bucket + F_Name + Title + FamilySize + FamilyID, 
                data=train, importance=TRUE, ntree=5000)

I am getting the following error

Error in randomForest.default(m, y, ...) : 
  NA/NaN/Inf in foreign function call (arg 1)
In addition: Warning messages:
1: In data.matrix(x) : NAs introduced by coercion
2: In data.matrix(x) : NAs introduced by coercion
3: In data.matrix(x) : NAs introduced by coercion
4: In data.matrix(x) : NAs introduced by coercion

My data looks like below

$ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
$ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
$ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1...
$ Age_Bucket : chr  "20-25" "30-40" "25-30" "30-40" ...
$ Fare_Bucket: chr  "<10" "30+" "<10" "30+" ...
$ Title      : Factor w/ 11 levels "Col","Dr","Lady",..: 7 8 5 8 7 7 7 4 8 8 ...
$ F_Name     : chr  "Braund" "Cumings" "Heikkinen" "Futrelle" ...
$ FamilySize : num  2 2 1 2 1 1 1 5 3 2 ...
$ Embarked   : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...
$ FamilyID   : chr  "Small" "Small" "Alone" "Small" ...

If i just type the below, I have no coercion issues which as far as i can see is the only place where coercion occurs to create NA values

as.factor(Survived)

Can anyone see the problem

Thank you for your time

cchamberlain
  • 17,444
  • 7
  • 59
  • 72
John Smith
  • 2,448
  • 7
  • 54
  • 78

1 Answers1

7

You need to convert your char columns into factors. Factors are treated as integers internally whereas character fields are not. See the following small demonstration:

Data:

df <- data.frame(y = sample(0:1, 26, rep=T), x1=runif(26), x2=letters, stringsAsFactors=F)

df$y <- as.factor(df$y)

> str(df)
'data.frame':   26 obs. of  3 variables:
 $ y : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 2 1 ...
 $ x1: num  0.457 0.296 0.517 0.478 0.764 ...
 $ x2: chr  "a" "b" "c" "d" ...

Now if I run my randomForest function:

> randomForest(y ~ x1 + x2, data=df)
Error in randomForest.default(m, y, ...) : 
  NA/NaN/Inf in foreign function call (arg 1)
In addition: Warning message:
In data.matrix(x) : NAs introduced by coercion

I get the same error you did.

Whereas if I convert the char column into factor:

df$x2 <- as.factor(df$x2)

> randomForest(y ~ x1 + x2, data=df)

Call:
 randomForest(formula = y ~ x1 + x2, data = df) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 1

        OOB estimate of  error rate: 61.54%
Confusion matrix:
  0  1 class.error
0 0 16           1
1 0 10           0

It works great!

LyzandeR
  • 37,047
  • 12
  • 77
  • 87
  • Hi, Sorry i should have been clearer. I ran the line "as.factor(Survived)" on its own and it converted everything fine into a factor as thats what i originally thought the problem was. When i run it in the Random Forest code it gives me the error about the coercian – John Smith May 10 '15 at 13:52
  • Can you please `dput` the data? – LyzandeR May 10 '15 at 13:58
  • I found the reason why it breaks! You got `+ FamilyID` in your code but this column is not in your dataset. – LyzandeR May 10 '15 at 14:02
  • Hi LyzandeR.....I just ran that piece of code to determine if that was the place where it was failing as its the only place i can see where NAs are introduced by coercion. This error only occurs when i run it based on the first segment of code in the OP. The error doesnt happen if i run "as.factor(Survived)" on its own. – John Smith May 10 '15 at 14:04
  • I see. I guess you need to `dput` the data otherwise no one will be able to troubleshoot. I dont have an account on Kaggle to get it myself. – LyzandeR May 10 '15 at 14:08
  • 1
    oh oh oh oh. You got `char` columns in there. And the `matrix` creation inside the `randomForest` function is failing. Can you please convert those to factors and try again? `Age_bucket` for example is char and when the matrix is created everything is coerced into NAs. – LyzandeR May 10 '15 at 14:10
  • 1
    That looks to be it :).....Thank you. sadly it looks like i have too many factors to run it so i will try Inference Trees instead. - Can not handle categorical predictors with more than 53 categories. – John Smith May 10 '15 at 14:16
  • You are welcome :). Glad I could be of help. I updated the answer as well. This is an annoying limitation I know... – LyzandeR May 10 '15 at 14:20