Overfitting despite of Missing value, Tree Based Learing

Question

my fellow students and me, we are working on a educational machine learning project and we are stuck with a overfitting-problem, as we are quite unexperienced to Data Mining.

Our business case is about retail banking and we aim to search for customer target groups according to products resp. to recommend customers specific products that are based on products that were bought already like stock shares, funds, deposits etc.

We received a data set with about 400 features and 150.000 data records. We build our workflows in Knime. Our workflow includes the following steps:

We explored the data and defined the target variables
We used a Missing Value Column Filter in order to eliminate all Columns with mostly Missing Values
We also applied the Tree Ensemble Workflow to reduce the dimensions

All in all we cleaned up our data and reduced it from 400 variables down to about 50. For Modelling we use a simple decision tree - and here appears the problem: This tree always gives out an accuracy of 100 % - so we assume it's highly overfitted.

Is there anything we are doing wrong? Or on what should we focus on?

We hope the community could help us with some hints or tips.

Edit: Are there any sources, papers etc how to apply cross up selling in a data mining tool e.g. knime? We googled it already but so far we've been unsuccessful.

Did you split into training / test set? Is there some way of limiting the depth of the tree? (Just top-of-head type questions to provoke thought!) — SteveR, Feb 15 '18 at 22:18
Hey SteveR Yes we splitted in 80/20 training and test set. Ok we will follow up this point thank you. — ste92, Feb 16 '18 at 07:29
Have you inspected the actual tree and the confusion matrix? Have you tried cross-validation? KNIME is a good tool but you might want to try something like Weka for an initial pass at the data as (for decision trees at least) it should give you something reasonable with default settings. If what you get in Weka looks right, then try and reproduce it in KNIME. — nekomatic, Feb 16 '18 at 11:53

score 1 · Answer 1 · answered Feb 16 '18 at 08:50

One of the problem with decision trees are they are pron to overfit. You can do Prunning that reduces the complexity of the model, and hence improves predictive accuracy by the reduction of overfitting also try tunning Min-sample-per-leaf, Maximum tree depth

score 1 · Answer 2 · answered Feb 16 '18 at 11:20

1

Agree with the previous comment: the main advantage of DT is their overfitting.

Try to make decision tree simpler (reduce depth at least)
Use ensemble methods (Random Forests or even XGBoost). They are the next generation of DT.

answered Feb 16 '18 at 11:20

avchauzov

1,007
1
8
13

Overfitting despite of Missing value, Tree Based Learing

2 Answers2