0

I'm getting error while working on C5.0 with Mushroom Data set. I've factored the target class and there are no missing values.

f <-file("https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data", open="r")
data <- read.table(f, sep=",", header=F)
str(data)

gives

'data.frame':   8124 obs. of  23 variables:
$ V1 : Factor w/ 2 levels "e","p": 2 1 1 2 1 1 1 1 2 1 ...
$ V2 : Factor w/ 6 levels "b","c","f","k",..: 6 6 1 6 6 6 1 1 6 1 ...
$ V3 : Factor w/ 4 levels "f","g","s","y": 3 3 3 4 3 4 3 4 4 3 ...
$ V4 : Factor w/ 10 levels "b","c","e","g",..: 5 10 9 9 4 10 9 9 9 10 ...
$ V5 : Factor w/ 2 levels "f","t": 2 2 2 2 1 2 2 2 2 2 ...
$ V6 : Factor w/ 9 levels "a","c","f","l",..: 7 1 4 7 6 1 1 4 7 1 ...
$ V7 : Factor w/ 2 levels "a","f": 2 2 2 2 2 2 2 2 2 2 ...
$ V8 : Factor w/ 2 levels "c","w": 1 1 1 1 2 1 1 1 1 1 ...
$ V9 : Factor w/ 2 levels "b","n": 2 1 1 2 1 1 1 1 2 1 ...
$ V10: Factor w/ 12 levels "b","e","g","h",..: 5 5 6 6 5 6 3 6 8 3 ...
$ V11: Factor w/ 2 levels "e","t": 1 1 1 1 2 1 1 1 1 1 ...
$ V12: Factor w/ 5 levels "?","b","c","e",..: 4 3 3 4 4 3 3 3 4 3 ...
$ V13: Factor w/ 4 levels "f","k","s","y": 3 3 3 3 3 3 3 3 3 3 ...
$ V14: Factor w/ 4 levels "f","k","s","y": 3 3 3 3 3 3 3 3 3 3 ...
$ V15: Factor w/ 9 levels "b","c","e","g",..: 8 8 8 8 8 8 8 8 8 8 ...
$ V16: Factor w/ 9 levels "b","c","e","g",..: 8 8 8 8 8 8 8 8 8 8 ...
$ V17: Factor w/ 1 level "p": 1 1 1 1 1 1 1 1 1 1 ...
$ V18: Factor w/ 4 levels "n","o","w","y": 3 3 3 3 3 3 3 3 3 3 ...
$ V19: Factor w/ 3 levels "n","o","t": 2 2 2 2 2 2 2 2 2 2 ...
$ V20: Factor w/ 5 levels "e","f","l","n",..: 5 5 5 5 1 5 5 5 5 5 ...
$ V21: Factor w/ 9 levels "b","h","k","n",..: 3 4 4 3 4 3 3 4 3 3 ...
$ V22: Factor w/ 6 levels "a","c","n","s",..: 4 3 3 4 1 3 3 4 5 4 ...
$ V23: Factor w/ 7 levels "d","g","l","m",..: 6 2 4 6 2 2 4 4 2 4 ...

and when i run

C5.model <- C5.0(data[1:4000,-1],data[1:4000,1],trials = 3)

gives

c50 code called exit with value 1

I had no clue how to find this. Any idea on debugging is appreciated

Edit1 : Error is same but solution is different. Note: When i changed the data set, it is working.

krishna
  • 25
  • 8
  • In that data set it has missing values so thats the problem. but this data set doesn't have any missing values. – krishna Jul 23 '16 at 06:36
  • Your data is degenerate. For examples, variables V7 & V17 only take one value. – tchakravarty Jul 23 '16 at 06:36
  • @tchakravarty This is correct though V7 is actually OK if he just includes a few more rows, as it has 2 levels. – Hack-R Jul 23 '16 at 07:00
  • @tchakravarty Thanks guys . That worked – krishna Jul 23 '16 at 07:02
  • @tchakravarty : What's wrong with V7 . Are these 2 levels are not fine to partition the data? – krishna Jul 23 '16 at 07:06
  • @Hack-R : what does it mean? – krishna Jul 23 '16 at 07:09
  • @krishna See my answer below (and mark as the solution if it is helpful). V7 is fine. You only included 4000 rows in the example and in those 4000 rows V7 only took on 1 value, therefore it would not work. However when you include more rows then V7 takes on 2 values and it is fine. V17 never works though because it always has only 1 value. – Hack-R Jul 23 '16 at 07:12

1 Answers1

0
f <-file("https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data", open="r")
data <- read.table(f, sep=",", header=F)
str(data)

pacman::p_load(C50)
C5.model <- C5.0(data[1:10000,c(2:16,18:23)],data[1:10000,1],trials = 3,na.action = na.pass)

Column 17 was the cause of this problem as it had no identifying variation.

Hack-R
  • 22,422
  • 14
  • 75
  • 131