Threshold value Z:
–The training samples are first sorted on the values of the attribute Y being considered. There are only a finite number of these values, so let us denote them in sorted order as {v1, v2, …, vm}. –Any threshold value lying between viand vi+1will have the same effect of dividing the cases into those whose value of the attribute Y lies in {v1, v2, …, vi} and those whose value is in {vi+1, vi+2, …, vm}. There are thus only m-1 possible splits on Y, all of which should be examined systematically to obtain an optimal split.
It is usual to choose the midpoint of each interval: (vi+vi+1)/2 as the representative threshold. –C4.5 chooses as the threshold a smaller value vifor every interval {vi, vi+1}, rather than the midpoint itself
I just want to know if get this right.
Lets say I have:
{65, 70, 75, 78, 80, 85, 90, 95, 96}.
I must do m-1 calculations to find the optimal value so
{65, 70, 75, 78, 80, 85, 90, 95}.
For each split (ex. 65 and >= 65 , <70 and >=70 and so on). I must calculate
the Gain ratio, and choose the split that gives me the higher gain. Am I right?