5

suppose I have a data set:

    X     y
   20     0
   22     0
   24     1
   27     0
   30     1
   40     1
   20     0
   ...

I try to discretize X into few bins by minimizing the entropy. so I did the following:

clf = tree.DecisionTreeClassifier(criterion = 'entropy',max_depth = 4)
clf.fit(X.values.reshape(-1,1),y.values)

threshold = clf.tree_.threshold[clf.tree_.threshold>-2]
threshold = np.sort(threshold)

'threshold' should give the splitting points, is this a correct way of binning data?

any suggestions?

user6396
  • 1,832
  • 6
  • 23
  • 38
  • This might be a silly question, but why are there so many -2 thresholds and why just exclude them? I might be missing an obvious google search that would reveal this (so apologies for the ignorance), but have not found anything so far. – pwjvr Nov 09 '18 at 06:41
  • @pwjvr - did you find out why there is so much of `-2`? I also have the same problem – The Great Jan 26 '22 at 09:25

1 Answers1

2

first, what you did is correct.

There are many ways to bin your data:

  1. based on the values of the column (like: dividing the column for 10 equal groups between min and max of the column value).
  2. based on the distribution of the column values, for example it's could be 10 groups based on the deciles of the column (better to use pandas.qcut for that)
  3. based on the target, like you did. I found this blog relevant to you and I think your method for finding the best splits works just fine https://towardsdatascience.com/discretisation-using-decision-trees-21910483fa4b
Yaron
  • 1,726
  • 14
  • 18