Many of the functions in R have limits on the number of levels a factor-type variable can have (ie randomForest
limits the number of levels of a factor to 32).
One way that I've seen it dealt with especially in data mining competitions is to:
1) Determine maximum number of levels allowed for a given function (call this X
).
2) Use table()
to determine the number of occurrences of each level of the factor and rank them from greatest to least.
3) For the top X - 1
levels of the factor leave them as is.
4) For the levels < X
change them all to one factor to identify them as low-occurrence levels.
Here's an example that's a bit long but hopefully helps:
# Generate 1000 random numbers between 0 and 100.
vars1 <- data.frame(values1=(round(runif(1000) * 100,0)))
# Changes values to factor variable.
vars1$values1 <- factor(vars1$values1)
# Show top 6 rows of data frame.
head(vars1)
# Show the number of unique factor levels
length(unique(vars1$values1 ))
# Create table showing frequency of each levels occurrence.
table1 <- data.frame(table(vars1 ))
# Orders the table in descending order of frequency.
table1 <- table1[order(-table1$Freq),]
head(table1)
# Assuming we want to use the CART we choose the top 51
# levels to leave unchanged
# Get values of top 51 occuring levels
noChange <- table1$vars1[1:51]
# we use '-1000' as factor to avoid overlap w/ other levels (ie if '52' was
# actually one of the levels).
# ifelse() checks to see if the factor level is in the list of the top 51
# levels. If present it uses it as is, if not it changes it to '-1000'
vars1$newFactor <- (ifelse(vars1$values1 %in% noChange, vars1$values1, "-1000"))
# Show the number of levels of the new factor column.
length(unique(vars1$newFactor))
Finally, you may want to consider using truncated variables in rpart
as the tree display gets very busy when there are a large number of variables or they have long names.