-1

Hi I'm a beginner in R programming language. I wrote one code for regression tree using rpart package. In my data some of my independent variables have more than 100 levels. After running the rpart function I'm getting following warning message "More than 52 levels in a predicting factor, truncated for printout" & my tree is showing in very weird way. Say for example my tree is splitting by location which has around 70 distinct levels, but when the label is displaying in tree then it is showing "ZZZZZZZZZZZZZZZZ..........." where I don't have any location called "ZZZZZZZZ"

Please help me.

Thanks in advance.

blahdiblah
  • 33,069
  • 21
  • 98
  • 152
R Learner
  • 545
  • 1
  • 7
  • 13

1 Answers1

3

Many of the functions in R have limits on the number of levels a factor-type variable can have (ie randomForest limits the number of levels of a factor to 32).

One way that I've seen it dealt with especially in data mining competitions is to:

1) Determine maximum number of levels allowed for a given function (call this X).

2) Use table() to determine the number of occurrences of each level of the factor and rank them from greatest to least.

3) For the top X - 1 levels of the factor leave them as is.

4) For the levels < X change them all to one factor to identify them as low-occurrence levels.

Here's an example that's a bit long but hopefully helps:

# Generate 1000 random numbers between 0 and 100.
vars1 <- data.frame(values1=(round(runif(1000) * 100,0)))
# Changes values to factor variable.
vars1$values1 <- factor(vars1$values1)
# Show top 6 rows of data frame.
head(vars1)
# Show the number of unique factor levels
length(unique(vars1$values1 ))
# Create table showing frequency of each levels occurrence.
table1 <- data.frame(table(vars1 ))
# Orders the table in descending order of frequency.
table1 <- table1[order(-table1$Freq),]
head(table1)
# Assuming we want to use the CART we choose the top 51
# levels to leave unchanged
# Get values of top 51 occuring levels
noChange <- table1$vars1[1:51]
# we use '-1000' as factor to avoid overlap w/ other levels (ie if '52' was 
# actually one of the levels).
# ifelse() checks to see if the factor level is in the list of the top 51
# levels.  If present it uses it as is, if not it changes it to '-1000'
vars1$newFactor <- (ifelse(vars1$values1 %in% noChange, vars1$values1, "-1000")) 
# Show the number of levels of the new factor column.
length(unique(vars1$newFactor))

Finally, you may want to consider using truncated variables in rpart as the tree display gets very busy when there are a large number of variables or they have long names.

screechOwl
  • 27,310
  • 61
  • 158
  • 267
  • Thank you for your response.I prepared the 52 levels for independent variable but still getting the same warning. I also tried with 32 levels but same thing is happening. – R Learner Apr 13 '12 at 09:43
  • @RLearner: If you could post a piece of the data it would be easier to figure out what's going on. – screechOwl Apr 13 '12 at 20:21
  • Hi screechOwl, I reduced the levels to 26 & now it is working – R Learner Apr 16 '12 at 03:54
  • @RLearner: That's interesting. Glad it works. It's curious that 26 is half of 52 which is the stated maximum. I'd play around with the code to double check there's not a mistake somewhere. Those types of relationships usually aren't an accident. Cheers. – screechOwl Apr 16 '12 at 04:28
  • Thank you screechOwl for your help. – R Learner Apr 19 '12 at 10:53