1

Using the "play golf" or "play ball" data (listed at the bottom), to pick the root node we look at Outlook, Temperature, Humidity, and Wind, to see which has the highest GainRatio.

Now, Outlook will be chosen as the attribute with the highest GainRatio. However, I am confused that Humidity (a Continuous Attribute) selects the split point 80 having a GainRatio=0.1087, while 65 has a higher GainRatio=0.1285. The split point 80 does have a higher Gain, but not GainRatio.

I have seen literature say roughly "pick the split point for a continuous attribute to be the one giving the most gain"... this seems counterintuitive to me that the split point is based on Gain alone, opposed to when comparing all the attributes you select the highest GainRatio to be the next decision node.

I hope to gain some clarity here.

Thanks.

The calculations are as follows:

OUTLOOK:
Gain = 0.2467
SplitInfo = 1.5774
Gain Ratio = 0.1564

TEMPERATURE:
Gain = 0.0292
SplitInfo = 1.5566
Gain Ratio = 0.0187

HUMIDITY:
Possible split points = { 65, 70, 75, 78, 80, 85, 90, 95, 96 }

Split 65:
Gain = 0.0477
SplitInfo = 0.3712
Gain Ratio = 0.1285

Split 80:
Gain = 0.1022
SplitInfo = 0.9402
Gain Ratio = 0.1087

WIND:
Gain = 0.0481
SplitInfo = 0.9852
Gain Ratio = 0.0488

DATA:

Outlook  Temperature  Humidity  Wind    Play
--------------------------------------------
sun        hot          85      low     no
sun        hot          90      high    no
overcast   hot          78      low     yes
rain       sweet        96      low     yes
rain       cold         80      low     yes
rain       cold         70      high    no
overcast   cold         65      high    yes
sun        sweet        95      low     no
sun        cold         70      low     yes
rain       sweet        80      low     yes
sun        sweet        70      high    yes
overcast   sweet        90      high    yes
overcast   hot          75      low     yes
rain       sweet        80      high    no
Rahul
  • 502
  • 7
  • 16
GreekFire
  • 359
  • 4
  • 15
  • That is an interesting question. Perhaps it could also be asked on http://stats.stackexchange.com/ – Cesar Mar 08 '15 at 18:48

1 Answers1

0

Information gain ratio is used to reduce bias towards attributes with large number of values by taking the number and size of the branches into account when choosing an attribute. Here, we have already chosen the attribute. Thus, we should try to maximize information gain instead of Information gain ratio.