For classification I used Weka's J48 decision tree to build a model on several nominal attributes. Now there is more data for classification (5 nonimal attributes) but each attribute has 3000 different values. I used J48 with pruning but it ran out of memory (associated 4GB). With a smaller dataset, I saw in the output, that J48 keeps all leaves with no instances associated with it. Why are they kept in the model? Should I switch to another classifcation algorithm?
Asked
Active
Viewed 569 times
0
-
2You need to do some feature processing, it is very not wise to throw in a categorical feature with 3000 values to a decision tree model. – TYZ Apr 02 '18 at 15:07
-
You could set J48's minNumObj hyperparameter to a higher value, say 20, or try the rules/PART algorithm which is (according to the context menu documentation) a simplified version of C4.5 / J48 - maybe it needs less memory – knb Apr 03 '18 at 07:36
-
1"keeps all leaves with no instances" -- is this in the test set? There may be no instances with those values in the test set. – zbicyclist Apr 04 '18 at 02:06