0

my fellow students and me, we are working on a educational machine learning project and we are stuck with a overfitting-problem, as we are quite unexperienced to Data Mining.

Our business case is about retail banking and we aim to search for customer target groups according to products resp. to recommend customers specific products that are based on products that were bought already like stock shares, funds, deposits etc.

We received a data set with about 400 features and 150.000 data records. We build our workflows in Knime. Our workflow includes the following steps:

  • We explored the data and defined the target variables
  • We used a Missing Value Column Filter in order to eliminate all Columns with mostly Missing Values
  • We also applied the Tree Ensemble Workflow to reduce the dimensions

All in all we cleaned up our data and reduced it from 400 variables down to about 50. For Modelling we use a simple decision tree - and here appears the problem: This tree always gives out an accuracy of 100 % - so we assume it's highly overfitted.

Is there anything we are doing wrong? Or on what should we focus on?

We hope the community could help us with some hints or tips.

Edit: Are there any sources, papers etc how to apply cross up selling in a data mining tool e.g. knime? We googled it already but so far we've been unsuccessful.

ste92
  • 434
  • 9
  • 23
  • Did you split into training / test set? Is there some way of limiting the depth of the tree? (Just top-of-head type questions to provoke thought!) – SteveR Feb 15 '18 at 22:18
  • Hey SteveR Yes we splitted in 80/20 training and test set. Ok we will follow up this point thank you. – ste92 Feb 16 '18 at 07:29
  • I edited my question do you have an idea ?? – ste92 Feb 16 '18 at 07:58
  • 1
    @Stelios what is your test accuracy? – janu777 Feb 16 '18 at 08:29
  • Have you inspected the actual tree and the confusion matrix? Have you tried cross-validation? KNIME is a good tool but you might want to try something like Weka for an initial pass at the data as (for decision trees at least) it should give you something reasonable with default settings. If what you get in Weka looks right, then try and reproduce it in KNIME. – nekomatic Feb 16 '18 at 11:53

2 Answers2

1

One of the problem with decision trees are they are pron to overfit. You can do Prunning that reduces the complexity of the model, and hence improves predictive accuracy by the reduction of overfitting also try tunning Min-sample-per-leaf, Maximum tree depth

Srikant
  • 113
  • 1
  • 2
  • 8
1

Agree with the previous comment: the main advantage of DT is their overfitting.

  1. Try to make decision tree simpler (reduce depth at least)
  2. Use ensemble methods (Random Forests or even XGBoost). They are the next generation of DT.
avchauzov
  • 1,007
  • 1
  • 8
  • 13