-1

The blog https://www.kdnuggets.com/2021/01/sparse-features-machine-learning-models.html mentions that the decision tree overfits the data in the case when we have sparse features.

To understand the intuition behind this, I tried fitting a decision tree on a dataset with one attribute feature and one output variable y containing binary class labels (i.e., 0 and 1). I computed the average validation accuracy for the 2 cases:

  1. When dataset is sparse (i.e., most the the values of column feature are 0).
  2. When dataset is non-sparse.

Snippet:

    import random
    import pandas as pd
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import accuracy_score
    
    
    def accuracy_for_given_percentage_of_sparse_data(percentage_of_sparse_data):
        n = 1000  # size of the dataframe or number of datapoints
    
        number_of_sparse_datapoints = int(percentage_of_sparse_data * n)
        number_of_non_sparse_datapoints = n - number_of_sparse_datapoints
    
        average_accuracy = 0
        number_of_runs = 1000
    
        for run in range(0, number_of_runs):
            # Setting 0s as values of feature
            feature = [0 for i in range(0, number_of_sparse_datapoints)]
    
            # Setting some random non zero values for feature
            for i in range(0, number_of_non_sparse_datapoints):
                feature.append(random.randint(1, 50))
    
            # Generate the 100 random binary class labels
            y = [random.randint(0, 1) for i in range(0, n)]
    
            d = {
                'feature': feature,
                'y': y
            }
    
            df = pd.DataFrame(data=d)
            clf = DecisionTreeClassifier(criterion="gini", splitter='best', class_weight='balanced')
    
            feature_cols = ['feature']
            X = df[feature_cols]
            y = df.y
    
            X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
    
            # Train Decision Tree Classifer
            clf = clf.fit(X_train, y_train)
    
            average_accuracy += accuracy_score(y_test, clf.predict(X_test))
    
        return average_accuracy / number_of_runs
    
    
    print("Average accuracy for sparse data: ", accuracy_for_given_percentage_of_sparse_data(0.9))
    print("Average accuracy for non-sparse data: ", accuracy_for_given_percentage_of_sparse_data(0.1))

Output I received:

Average accuracy for sparse data:  0.5015030303030307
Average accuracy for non-sparse data:  0.5009696969696967

My question:

  1. I cannot see much difference in the 2 accuracies. Am I missing something or doing something wrong here?
  2. What is the mathematical intuition behind the fact that decision tree overfits the data in the case when we have sparse features?
Deepak Tatyaji Ahire
  • 4,883
  • 2
  • 13
  • 35

1 Answers1

0

Regarding Q1: Of course you are getting 50% accuracy. Your training data is random, so there is no pattern to learn from.

Regarding Q2: Decision trees tend to overfit, especially on sparse data, because they can create a lot of very specific rules. These rules will apply to the small number of examples you are training on, and will result in very complex rules. So when you try and generalize to new data, the rules will be so specific to your training data that it won't do a good job on the new data.

Proxygonn
  • 51
  • 3
  • Hey @Proxygonn, can you please suggest some snippet with apt dataset where I will be able to visualise a significant difference between the accuracies? – Deepak Tatyaji Ahire Mar 11 '23 at 15:00
  • @DeepakTatyajiAhire Basically what you have but use sklearns [make_classification](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html) method to generate your data instead of generating random data – Proxygonn Mar 11 '23 at 15:04