I'm trying to work around the randomForest package limit of 32 levels for factors.
I have a data set with 100 levels in one of the factor variables.
I wrote the following code to see what things would look like using sampling with replacement and how many tries it would take to get certain % of levels selected.
sampAll <- c()
nums1 <- seq(1,102,1)
for(i in 1:20){
samp1 <- sample(nums1, 32)
sampAll <- unique(cbind(sampAll, samp1))
outSamp1 <- nums1[-(sampAll[,1:ncol(sampAll)])]
print(paste(i, " | Remaining: ",length(outSamp1)/102,sep=""))
flush.console()
}
[1] "1 | Remaining: 0.686274509803922"
[1] "2 | Remaining: 0.490196078431373"
[1] "3 | Remaining: 0.333333333333333"
[1] "4 | Remaining: 0.254901960784314"
[1] "5 | Remaining: 0.215686274509804"
[1] "6 | Remaining: 0.147058823529412"
[1] "7 | Remaining: 0.117647058823529"
[1] "8 | Remaining: 0.0980392156862745"
[1] "9 | Remaining: 0.0784313725490196"
[1] "10 | Remaining: 0.0784313725490196"
[1] "11 | Remaining: 0.0490196078431373"
[1] "12 | Remaining: 0.0294117647058824"
[1] "13 | Remaining: 0.0196078431372549"
[1] "14 | Remaining: 0.00980392156862745"
[1] "15 | Remaining: 0.00980392156862745"
[1] "16 | Remaining: 0.00980392156862745"
[1] "17 | Remaining: 0.00980392156862745"
[1] "18 | Remaining: 0"
[1] "19 | Remaining: 0"
[1] "20 | Remaining: 0"
What I'm debating is whether to sample with or without replacement.
I'm thinking about:
- getting a sample of 32 of the 100 factors,
- using those lines to run the randomForest,
- predicting the test set with the randomForest and
- repeating this process either (a) 3(WITHOUT replacement) or (b) 10-15 times (WITH replacement).
- taking the 3 or 10-15 predicted values, finding the average and using that as a final predictor.
I'm curious if anyone has tried something like this or if I'm breaking any rules (introducing bias, etc.) or if anyone has any suggestions.
NOTE: I've cross-posted this question on Stats-Overflow / Cross-Validated as well.