To split a node into two different child nodes, one method consists splitting the node according to the variable that can maximise your information gain.
When you reach a pure leaf node, the information gain equals 0 (because you can't gain any information by splitting a node containing only one variable - logic
).
In your example Entropy(S) = 1.571
is your current entropy - the one you have before splitting. Let's call it HBase
.
Then you compute the entropy depending on several splittable parameters.
To get your Information Gain, you substract the entropy of your child nodes to HBase
-> gain = Hbase - child1NumRows/numOfRows*entropyChild1 - child2NumRows/numOfRows*entropyChild2
def GetEntropy(dataSet):
results = ResultsCounts(dataSet)
h = 0.0 #h => entropy
for i in results.keys():
p = float(results[i]) / NbRows(dataSet)
h = h - p * math.log2(p)
return h
def GetInformationGain(dataSet, currentH, child1, child2):
p = float(NbRows(child1))/NbRows(dataSet)
gain = currentH - p*GetEntropy(child1) - (1 - p)*GetEntropy(child2)
return gain
The objective is to get the best of all Information Gains!
Hope this helps! :)