Tree based algorithm different behavior with duplicated features

Question

I don't understand why I have three different behaviors depending on the classifier I use, even though they should go hand in hand.

This is the code in order to go deeply in the question:

from sklearn import datasets
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from lightgbm import LGBMClassifier 
from sklearn.model_selection import cross_validate
import matplotlib.pyplot as plt
import numpy as np

#load data

wine = datasets.load_wine()
X = wine.data
y = wine.target

# some helper functions

def repeat_feature(X,which=1,times=1):
    return np.hstack([X,np.hstack([X[:, :which]]*times)])

def do_the_job(X,y,clf):
    return np.mean(cross_validate(clf, X, y,cv=5)['test_score'])

# define the classifiers

clf1=DecisionTreeClassifier(max_depth=25,random_state=42)
clf2=RandomForestClassifier(n_estimators=5,random_state=42)
clf3=LGBMClassifier(n_estimators=5,random_state=42)


# repeat up to 50 times the same feature and test the classifiers

clf1_result=[]
clf2_result=[]
clf3_result=[]

for i in range(1,50):
    my_x=repeat_feature(X,times=i)
    clf1_result.append(do_the_job(my_x,y,clf1))
    clf2_result.append(do_the_job(my_x,y,clf2))
    clf3_result.append(do_the_job(my_x,y,clf3))
    
    
# plot the mean of the cv-scores for each classifier    
    
plt.figure(figsize=(12,7))
plt.plot(clf1_result,label='tree')
plt.plot(clf2_result,label='forest')
plt.plot(clf3_result,label='boost')
plt.legend()

The result of the previous script is the following graph:

What I want to verify is that by adding the same information (like a repeated feature) I would get a decrease in the score (which happens as expected for random forest).

The question is why does this not happen with the other two classifiers instead? Why do their scores remain stable?

Am I missing something from the theoretical point of view?

Ty all

score 2 · Accepted Answer · answered Jul 21 '22 at 02:11

When fitting a single decision tree (sklearn.tree.DecisionTreeClassifier) or a LightGBM model using its default behavior (lightgbm.LGBMClassifier), the training algorithm considers all features as candidates for every split, and always chooses the split with the best "gain" (reduction in the training loss).

Because of this, adding multiple identical copies of the same feature will not change the fit to the training data.

For random forest, on the other hand, the training algorithm randomly selects a subset of features to consider at each split. The random forest learns how to explain the training data by ensembling together multiple slightly-different models, and this can be effective because the different models explain different characteristics of the target. If you hold the number of trees + the number of leaves per tree constant, then adding copies of a feature reduces the diversity of the trees in the forest, which reduces the forest's fit to the training data.

Do you happen to know why the LGBM has literally-constant score while the DT does not? — Ben Reiniger, Jul 21 '22 at 13:35
`sklearn.datasets.load_wine()` returns a dataset with only 178 observations, which might mean that there are very few possible splits based on LightGBM's default settings. Training a LightGBM on such a small dataset usually requires overriding those defaults. See [this answer](https://stackoverflow.com/a/66728185/3986677). — James Lamb, Jul 21 '22 at 14:28
I had thought the decision tree should also be literally-constant, but I think changing the shape of the data by adding duplicate columns changes how the columns are shuffled; the random state can't keep that constant with additional columns. — Ben Reiniger, Jul 27 '22 at 15:48

Tree based algorithm different behavior with duplicated features

1 Answers1