1

I am creating a decision tree with n=3410. The target value contains 6 unique values. Each of these n=3410 have one of these 6 values. The distribution of the values in the data set used to create the model are:

1 - 242 2 - 917 3 - 645 4 - 488 5 - 261 6 - 841

However, when creating the model from this data, values 1 and 5 have a 100% error rate. The root node error rate is also super high - 73%

I'm trying to understand what can cause this problem. I see the relative occurrence of these 2 values in the set is lower, but not statistically insignificant. I can't explain the root node error at all though.

I've tried tuning the tree and manipulating the data set itself, but I am still consistently getting an overall error in the matrix of about 60%. I'm not really understanding what this means or how I can improve it - or if it's just the data I'm using.

L KC
  • 11
  • 1

0 Answers0