Numeric Values in C4.5 algorithm

Asked Nov 26 '15 at 17:00

Active Nov 26 '15 at 17:46

Viewed 398 times

Threshold value Z:

–The training samples are first sorted on the values of the attribute Y being considered. There are only a finite number of these values, so let us denote them in sorted order as {v1, v2, …, vm}. –Any threshold value lying between viand vi+1will have the same effect of dividing the cases into those whose value of the attribute Y lies in {v1, v2, …, vi} and those whose value is in {vi+1, vi+2, …, vm}. There are thus only m-1 possible splits on Y, all of which should be examined systematically to obtain an optimal split.

It is usual to choose the midpoint of each interval: (vi+vi+1)/2 as the representative threshold. –C4.5 chooses as the threshold a smaller value vifor every interval {vi, vi+1}, rather than the midpoint itself

I just want to know if get this right.

Lets say I have:

{65, 70, 75, 78, 80, 85, 90, 95, 96}.

I must do m-1 calculations to find the optimal value so

{65, 70, 75, 78, 80, 85, 90, 95}.

For each split (ex. 65 and >= 65 , <70 and >=70 and so on). I must calculate
the Gain ratio, and choose the split that gives me the higher gain. Am I right?

edited Nov 26 '15 at 17:46

Lee Taylor

7,761
16
33
49

asked Nov 26 '15 at 17:00

Nick

2,818
5
42
60

Numeric Values in C4.5 algorithm

0 Answers0