11

I am getting the following error

c50 code called exit with value 1

I am doing this on the titanic data available from Kaggle

# Importing datasets
train <- read.csv("train.csv", sep=",")

# this is the structure
  str(train)

Output :-

    'data.frame':   891 obs. of  12 variables:
 $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
 $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
 $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
 $ Name       : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
 $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
 $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
 $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
 $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
 $ Ticket     : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
 $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
 $ Cabin      : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
 $ Embarked   : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...

Then I tried using C5.0 dtree

# Trying with C5.0 decision tree
library(C50)

#C5.0 models require a factor outcome otherwise error
train$Survived <- factor(train$Survived)

new_model <- C5.0(train[-2],train$Survived)

So running the above lines gives me this error

c50 code called exit with value 1

I'm not able to figure out what's going wrong? I was using similar code on different dataset and it was working fine. Any ideas about how can I debug my code?

-Thanks

cchamberlain
  • 17,444
  • 7
  • 59
  • 72
zephyr
  • 1,775
  • 6
  • 20
  • 31

6 Answers6

15

For anyone interested, the data can be found here: http://www.kaggle.com/c/titanic-gettingStarted/data. I think you need to be registered in order to download it.

Regarding your problem, first of I think you meant to write

new_model <- C5.0(train[,-2],train$Survived)

Next, notice the structure of the Cabin and Embarked Columns. These two factors have an empty character as a level name (check with levels(train$Embarked)). This is the point where C50 falls over. If you modify your data such that

levels(train$Cabin)[1] = "missing"
levels(train$Embarked)[1] = "missing"

your algorithm will now run without an error.

Marco
  • 1,472
  • 16
  • 29
  • Thanks Marco. It worked!! The missing values in Cabin and Embarked column were causing the issue. The other thing I observed is that train[-2] and train[,-2] have the same output... Is there any other difference between the two ?? – zephyr Apr 02 '14 at 08:08
  • You are right, it seems to work for data.frames. I always use train[,-2], since for matrices train[-2] will transform the result into a vector and just remove one element. This is because conceptually matrices are like vectors and you can access every element of them without specifying row/column – Marco Apr 02 '14 at 08:22
  • Oops. Now the next step is giving similar code exit error. I read the test.csv into test data frame. Then :- new_model_predict <- predict(new_model,test) on the test data. Also I assigned missing labels in Cabin and Embarked columns of test data as well. – zephyr Apr 02 '14 at 08:24
  • I don't have much experience with the C50 library, but is it possible that the factors in the train and test set need to have the same levels? If you don't include the factors that have different levels (Name, Ticket, Cabin, Embarked) it runs fine – Marco Apr 02 '14 at 09:06
  • Thanks for helping out so far. Seems like I need to research more on this. – zephyr Apr 02 '14 at 09:38
8

Just in case. You can take a look to the error by

summary(new_model)

Also this error occurs when there are a special characters in the name of a variable. For example, one will get this error if there is "я"(it's from Russian alphabet) character in the name of a variable.

Rustam Guliev
  • 936
  • 10
  • 15
6

Here is what worked finally:-

Got this idea after reading this post

library(C50)

test$Survived <- NA

combinedData <- rbind(train,test)

combinedData$Survived <- factor(combinedData$Survived)

# fixing empty character level names 
levels(combinedData$Cabin)[1] = "missing"
levels(combinedData$Embarked)[1] = "missing"

new_train <- combinedData[1:891,]
new_test <- combinedData[892:1309,]

new_model <- C5.0(new_train[,-2],new_train$Survived)

new_model_predict <- predict(new_model,new_test)

submitC50 <- data.frame(PassengerId=new_test$PassengerId, Survived=new_model_predict)
write.csv(submitC50, file="c50dtree.csv", row.names=FALSE)

The intuition behind this is that in this way both the train and test data set will have consistent factor levels.

zephyr
  • 1,775
  • 6
  • 20
  • 31
3

I had the same error, but I was using a numeric dataset without missing values.

After a long time, I discovered that my dataset had a predictive attribute called "outcome" and the C5.0Control use this name, and this was the error cause :'(

My solution was changing the column name. Other way, would be create a C5.0Control object and change the value of the label attribute and then pass this object as parameter for the C50 method.

Adriano Rivolli
  • 2,048
  • 1
  • 13
  • 13
0

I also struggled some hours with the same Problem (Return code "1") when building a model as well as when predicting. With the hint of answer of Marco I have written a small function to remove all factor levels equal to "" in a data frame or vector, see code below. However, since R does not allow for pass by reference to functions, you have to use the result of the function (it can not change the original dataframe):

removeBlankLevelsInDataFrame <- function(dataframe) {
  for (i in 1:ncol(dataframe)) {
    levels <- levels(dataframe[, i])
    if (!is.null(levels) && levels[1] == "") {
      levels(dataframe[,i])[1] = "?"
    }
  }
  dataframe
}

removeBlankLevelsInVector <- function(vector) {
  levels <- levels(vector)
  if (!is.null(levels) && levels[1] == "") {
    levels(vector)[1] = "?"
  }
  vector
}

Call of the functions may look like this:

trainX = removeBlankLevelsInDataFrame(trainX)
trainY = removeBlankLevelsInVector(trainY)
model = C50::C5.0.default(trainX,trainY)

However, it seems, that C50 has a similar Problem with character columns containing an empty cell, so you will have probably to extend this to handle also character attributes if you have some.

0

I also got the same error, but it was because of some illegal characters in the factor levels of one the columns.

I used make.names function and corrected the factor levels:

levels(FooData$BarColumn) <- make.names(levels(FooData$BarColumn))

Then the problem was resolved.

Hamed2005
  • 729
  • 8
  • 11