I am using scikit-learn to create a decision tree, and its working like a charm. I would like to achieve one more thing: to make the tree to split on an attribute only once.
The reason behind this is because of my very strange dataset. I use a noisy dataset, and i am really interested in the noise as well. My class outcomes are binary let say [+,-]. I have a bunch of attributes with numbers mostly in the range of (0,1).
When scikit-learn creates the tree it splits on attributes multiple times, to make the tree "better". I understand that in this way the leaf nodes become more pure, but thats not the case i would like to achieve.
The thing i did was to define cutoffs for every attribute by counting the the information gain in different cutoffs, and choosing the maximum. In this way with "leave-one-out" and "1/3-2/3" cross validation techniques i get better results then the original tree.
The problem is that when i try to automatize this, i run into a problem around the lower and upper bound e.g. around 0 and 1 because most of the elements will be under/upper that and i get really high informational gain, cause one of the sets are pure, even if it only contains 1-2% of the full data.
All in all i would like to do something to make scikit-learn to only split on an attribute once.
If it cannot be done, do you guys have any advice how to generate those cutoffs in a nice way?