0

I am having issue with training C50 on my dataset. Before this post, I have researched all the other similar issues/solutions people had. However, my dataset has none of the issue they had but still failed the C50 execution in r. My dataset looks like:

'data.frame':   113967 obs. of  15 variables:
$ region                : Factor w/ 51 levels "US:AK","US:AL",..: 2 3 3 4 4 4 4 5 5 5 ...
$ city                  : Factor w/ 6396 levels "179708","179720",..: 24 156 156 194 214 226 244 276 316 407 ...
$ dma                   : Factor w/ 211 levels "1","500","501",..: 24 148 148 173 173 173 189 195 204 208 ...
$ user_day              : Factor w/ 7 levels "0","1","2","3",..: 6 6 6 6 6 6 6 6 6 6 ...
$ user_hour             : Factor w/ 24 levels "0","1","10","11",..: 5 16 16 4 22 7 10 11 15 21 ...
$ os_extended           : Factor w/ 71 levels "0","100","113",..: 55 68 68 7 29 14 14 14 29 34 ...
$ browser               : Factor w/ 19 levels "0","10","11",..: 19 18 18 8 18 9 18 17 18 18 ...
$ domain                : Factor w/ 2685 levels "0calc.com","100daysofrealfood.com",..: 1709 777 777 1406 727 2658 1406 1604 964 2658 ...
$ position              : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 2 1 1 1 2 ...
$ placement             : Factor w/ 5406 levels "10004098","10008956",..: 3331 1696 1714 3600 438 479 3598 3423 5406 479 ...
$ publisher             : Factor w/ 1641 levels "1000773","1000776",..: 581 687 687 663 1369 1525 663 624 1641 1525 ...
$ seller_member_id      : Factor w/ 304 levels "1001","1019",..: 19 101 101 40 19 35 40 40 75 35 ...
$ user_group            : Factor w/ 1000 levels "0","1","10","100",..: 252 243 243 363 343 342 162 380 122 212 ...
$ size                  : Factor w/ 7 levels "160x600","300x250",..: 5 2 2 4 5 2 2 1 2 2 ...
$ predict.bid.vector.bin: Factor w/ 2 levels "(0.112,0.831]",..: 1 1 1 1 1 1 1 2 1 2 ...

As you can see, the last variable is my target variable (as factor) and all features here have more than 1 level. Moreover, there is no NA in the dataset. Yet, when i execute the C50, i got error:

> library(C50)
> myC50_Tree <- C5.0(x = test_set[,-15], y = test_set$predict.bid.vector.bin)

c50 code called exit with value 1

> summary(myC50_Tree)

Call:
C5.0.default(x = test_set[, -15], y = test_set$predict.bid.vector.bin)


C5.0 [Release 2.07 GPL Edition]     Fri Apr 13 14:29:54 2018
-------------------------------

*** line 6 of `undefined.names': attribute `region' has only one value `US'

Error limit exceeded

What would be the issue here?

***You can get the simulated dataset of mine with following r code:

# --- Set unique feature values

region <- c("US:AL","US:AR","US:AZ","US:CA","US:CO","US:CT","US:DC","US:FL")
city <- c("179944","180802","181120","181212","181251","181315","181400","181512","181762","181842","181934","181953","182259","182295")
dma <- c('522','693','754','875','345','234')
user_day <- c('1','2','3','4','5','6')
user_hour <- c('12','11','10','9','8','7','6','5')
os_extended <- c('187','92','125','87','90')
browser <- c('8','9','18','5')
domain <- c('yahoo.com','youtube.com','mmctw.com','msn.com','frive.com','wework.com')
position <- c('0','1','2','3')
placement <- c('`234123412','34563451','235234624','46785467','234556834','85991927394')
publisher <- c('5345','57867','78034','123452','84567','245645','956752')
seller_memeber_id <- c('234','745','546','687','235')
user_group <- c('112','556','009','345','238')
size <- c('100X20','340X10','300X500','300X600')
predict.bid.vector.bin <- c('(0.831,1.55]',  '(0.112,0.831]')
features <- list(region,city,dma,user_day,user_hour,os_extended,browser,domain,position,placement,publisher,seller_memeber_id,user_group,size,predict.bid.vector.bin)

# --- Sample simulated dataset

test_set <- vector()

for (feature in 1:length(features)) {
  test_set <- cbind(test_set, sample(features[[feature]],1000,replace=TRUE))
}
test_set <- data.frame(test_set)
colnames(test_set) <- c('region','city','dma','user_day','user_hour',
                      'os_extended','browser','domain','position',
                      'placement','publisher','seller_memeber_id',
                      'user_group','size','predict.bid.vector.bin')
# --- check data

str(test_set)
Mark Li
  • 429
  • 1
  • 7
  • 21
  • Possible duplicate of [C5.0 decision tree - c50 code called exit with value 1](https://stackoverflow.com/questions/22803310/c5-0-decision-tree-c50-code-called-exit-with-value-1) – mysteRious Apr 13 '18 at 18:38
  • @mysteRious I have checked that post. As i said, my dataset doesn't have the same issue and their solution doesn't applied to my issue here. I don't think it's a duplicated post. – Mark Li Apr 13 '18 at 18:40
  • In that case you should provide us with a sample of your data so the problem is reproducible. Right now it's not possible to explore the error. Thanks :) – mysteRious Apr 13 '18 at 18:44
  • @mysteRious Sure, I have attached r code to generate simulated dataset. Thanks! – Mark Li Apr 13 '18 at 19:19

1 Answers1

1

The problem is the variable name region -- I think C5.0 doesn't like the colons in there. I recreated your dataset with:

region <- c("AL","AR","AZ","CA","CO","CT","DC","FL")

And then it worked with no errors:

treeModel <- C5.0(x=test_set[,-15],y=test_set[,15])
treeModel

...
Evaluation on training data (1000 cases):

    Decision Tree   
  ----------------  
  Size      Errors  

   103  220(22.0%)   <<


   (a)   (b)    <-classified as
  ----  ----
   358   122    (a): class 1
    98   422    (b): class 2


Attribute usage:

100.00% user_hour
 28.30% region
 27.30% dma
 24.30% city
 17.60% user_day
 15.40% size
 12.70% placement
  9.10% user_group
  7.90% browser
  6.50% os_extended
  4.70% publisher
  4.40% position
  3.70% domain
  3.00% seller_memeber_id

I also recoded the dependent variable as 1 and 2 just in case the string with the ranges was giving it a problem, but that didn't seem to matter at all (however in the output above you'll see that it predicted to Class 1 and Class 2, and that's why).

mysteRious
  • 4,102
  • 2
  • 16
  • 36
  • Got it! Thanks so much for the quick solution! Hmm.. C5.0 seems to be very picky on feature formats;( – Mark Li Apr 13 '18 at 20:38