1

I’m fairly new to decision trees. When I do a Chi-square test between a binary, categorical variable and family size, I get the following p-value and subsequent pairwise p-values from post-hoc analysis using the Bonferroni control method in the “fifer” package in R which uses the Fisher test:

X-squared = 29.546, df = 4, p-value = 6.055e-06

   comparison  raw.p    adj.p
1     0 vs. 1   0.0000 0.0000
2     0 vs. 2   0.2254 0.3220
3     0 vs. 3   0.5956 0.6618
4     0 vs. 4   0.1354 0.2707
5     1 vs. 2   0.5475 0.6618
6     1 vs. 3   0.0367 0.1223
7     1 vs. 4   0.0028 0.0140
8     2 vs. 3   0.2076 0.3220
9     2 vs. 4   0.0579 0.1448
10    3 vs. 4   0.6815 0.6815

However, when I create a decision tree based on the same data, using 'method = “class”', and gini split criterion with cp = .01 within the rpart package of R, the tree splits at family size of 1 (as I would expect based on the table above), and then at 3 (not what I would expect base on the table above) and subsequently at 2. I had expected the tree splits to align with the Chi-square table, meaning it would split in order of significance, i.e. split at 1 then split at 4. Is this an incorrect line of thinking? If so, why? I am under the impression that both methods use the same test to determine significance and that the decision tree would be split accordingly, but that seems incorrect.

I've researched both here on Stackoverflow as well as other places online. I came across this article, which seems to corroborate my thinking, but I'm still unsure as to why I'm getting different results.

Naim
  • 31
  • 7

0 Answers0