I am learning Decision Tree method for machine learning. Right now, the most important piece of code I use is c5. 0
. Got to admit, it is a genius' work. But i couldn't understand how it chooses the root and decision nodes.
Example: I have a database named 'credit'. here is first few columns:
str(credit)
'data.frame': 1000 obs. of 21 variables:
$ checking_balance : Factor w/ 4 levels "< 0 DM","> 200 DM",..: 1 3 4 1 1 4 4 3 4 3 ...
$ months_loan_duration: int 6 48 12 42 24 36 24 36 12 30 ...
$ credit_history : Factor w/ 5 levels "critical","delayed",..: 1 5 1 5 2 5 5 5 5 1 ...
$ purpose : Factor w/ 10 levels "business","car (new)",..: 8 8 5 6 2 5 6 3 8 2 ...
$ amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
So when i look at the decision tree after having applied c5.0, i see that the root node is $cheking balance
, then the next decision node is $credit_history.
What is the strategy or the trajectory c5.0 follows when creating a decision tree? In other words, how does it determine the order of decision nodes?