I don't understand why I have three different behaviors depending on the classifier I use, even though they should go hand in hand.
This is the code in order to go deeply in the question:
from sklearn import datasets
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from lightgbm import LGBMClassifier
from sklearn.model_selection import cross_validate
import matplotlib.pyplot as plt
import numpy as np
#load data
wine = datasets.load_wine()
X = wine.data
y = wine.target
# some helper functions
def repeat_feature(X,which=1,times=1):
return np.hstack([X,np.hstack([X[:, :which]]*times)])
def do_the_job(X,y,clf):
return np.mean(cross_validate(clf, X, y,cv=5)['test_score'])
# define the classifiers
clf1=DecisionTreeClassifier(max_depth=25,random_state=42)
clf2=RandomForestClassifier(n_estimators=5,random_state=42)
clf3=LGBMClassifier(n_estimators=5,random_state=42)
# repeat up to 50 times the same feature and test the classifiers
clf1_result=[]
clf2_result=[]
clf3_result=[]
for i in range(1,50):
my_x=repeat_feature(X,times=i)
clf1_result.append(do_the_job(my_x,y,clf1))
clf2_result.append(do_the_job(my_x,y,clf2))
clf3_result.append(do_the_job(my_x,y,clf3))
# plot the mean of the cv-scores for each classifier
plt.figure(figsize=(12,7))
plt.plot(clf1_result,label='tree')
plt.plot(clf2_result,label='forest')
plt.plot(clf3_result,label='boost')
plt.legend()
The result of the previous script is the following graph:
What I want to verify is that by adding the same information (like a repeated feature) I would get a decrease in the score (which happens as expected for random forest).
The question is why does this not happen with the other two classifiers instead? Why do their scores remain stable?
Am I missing something from the theoretical point of view?
Ty all