1

Threshold value Z:

–The training samples are first sorted on the values of the attribute Y being considered. There are only a finite number of these values, so let us denote them in sorted order as {v1, v2, …, vm}. –Any threshold value lying between viand vi+1will have the same effect of dividing the cases into those whose value of the attribute Y lies in {v1, v2, …, vi} and those whose value is in {vi+1, vi+2, …, vm}. There are thus only m-1 possible splits on Y, all of which should be examined systematically to obtain an optimal split.

It is usual to choose the midpoint of each interval: (vi+vi+1)/2 as the representative threshold. –C4.5 chooses as the threshold a smaller value vifor every interval {vi, vi+1}, rather than the midpoint itself

I just want to know if get this right.

Lets say I have:

{65, 70, 75, 78, 80, 85, 90, 95, 96}. 

I must do m-1 calculations to find the optimal value so

{65, 70, 75, 78, 80, 85, 90, 95}.     

For each split (ex. 65 and >= 65 , <70 and >=70 and so on). I must calculate
the Gain ratio, and choose the split that gives me the higher gain. Am I right?

Lee Taylor
  • 7,761
  • 16
  • 33
  • 49
Nick
  • 2,818
  • 5
  • 42
  • 60

0 Answers0