my fellow students and me, we are working on a educational machine learning project and we are stuck with a overfitting-problem, as we are quite unexperienced to Data Mining.
Our business case is about retail banking and we aim to search for customer target groups according to products resp. to recommend customers specific products that are based on products that were bought already like stock shares, funds, deposits etc.
We received a data set with about 400 features and 150.000 data records. We build our workflows in Knime. Our workflow includes the following steps:
- We explored the data and defined the target variables
- We used a Missing Value Column Filter in order to eliminate all Columns with mostly Missing Values
- We also applied the Tree Ensemble Workflow to reduce the dimensions
All in all we cleaned up our data and reduced it from 400 variables down to about 50. For Modelling we use a simple decision tree - and here appears the problem: This tree always gives out an accuracy of 100 % - so we assume it's highly overfitted.
Is there anything we are doing wrong? Or on what should we focus on?
We hope the community could help us with some hints or tips.
Edit: Are there any sources, papers etc how to apply cross up selling in a data mining tool e.g. knime? We googled it already but so far we've been unsuccessful.