The blog https://www.kdnuggets.com/2021/01/sparse-features-machine-learning-models.html mentions that the decision tree overfits the data in the case when we have sparse features.
To understand the intuition behind this, I tried fitting a decision tree on a dataset with one attribute feature
and one output variable y
containing binary class labels (i.e., 0 and 1). I computed the average validation accuracy for the 2 cases:
- When dataset is sparse (i.e., most the the values of column
feature
are 0). - When dataset is non-sparse.
Snippet:
import random
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
def accuracy_for_given_percentage_of_sparse_data(percentage_of_sparse_data):
n = 1000 # size of the dataframe or number of datapoints
number_of_sparse_datapoints = int(percentage_of_sparse_data * n)
number_of_non_sparse_datapoints = n - number_of_sparse_datapoints
average_accuracy = 0
number_of_runs = 1000
for run in range(0, number_of_runs):
# Setting 0s as values of feature
feature = [0 for i in range(0, number_of_sparse_datapoints)]
# Setting some random non zero values for feature
for i in range(0, number_of_non_sparse_datapoints):
feature.append(random.randint(1, 50))
# Generate the 100 random binary class labels
y = [random.randint(0, 1) for i in range(0, n)]
d = {
'feature': feature,
'y': y
}
df = pd.DataFrame(data=d)
clf = DecisionTreeClassifier(criterion="gini", splitter='best', class_weight='balanced')
feature_cols = ['feature']
X = df[feature_cols]
y = df.y
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
# Train Decision Tree Classifer
clf = clf.fit(X_train, y_train)
average_accuracy += accuracy_score(y_test, clf.predict(X_test))
return average_accuracy / number_of_runs
print("Average accuracy for sparse data: ", accuracy_for_given_percentage_of_sparse_data(0.9))
print("Average accuracy for non-sparse data: ", accuracy_for_given_percentage_of_sparse_data(0.1))
Output I received:
Average accuracy for sparse data: 0.5015030303030307
Average accuracy for non-sparse data: 0.5009696969696967
My question:
- I cannot see much difference in the 2 accuracies. Am I missing something or doing something wrong here?
- What is the mathematical intuition behind the fact that
decision tree overfits the data in the case when we have sparse features
?