-1

I try to get the importance weights of every feature from my dataframe. I use this code from scikit documentation:

names=['Class label', 'Alcohol',
'Malic acid', 'Ash',
'Alcalinity of ash', 'Magnesium',
'Total phenols', 'Flavanoids',
'Nonflavanoid phenols',
'Proanthocyanins',
'Color intensity', 'Hue',
'OD280/OD315 of diluted wines',
'Proline']
df_wine = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data', header=None,names=names)



from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(n_estimators=10000,
 random_state=0,
 n_jobs=-1)
forest.fit(X_train, y_train)

feat_labels = df_wine.columns[1:]
importances = forest.feature_importances_ 
indices = np.argsort(importances)[::-1]
for f in range(X_train.shape[1]):
    print("%2d) %-*s %f" % (f + 1, 30,feat_labels[f], importances[indices[f]]))

but despite I understand np.argsort method, I still don't comprehend this FOR loop. Why do we use "indices" for indexing "importances" array? And why we can't simply use such code:

for f in range(X_train.shape[1]):
print("%2d) %-*s %f" % (f + 1, 30,feat_labels[f], importances[f]))

Output in case of using "importances[indices[f]]"(first 5 rows):

 1) Alcohol                        0.182483
 2) Malic acid                     0.158610
 3) Ash                            0.150948
 4) Alcalinity of ash              0.131987
 5) Magnesium                      0.106589

Output in case of "importances[f]"(first 5 rows):

 1) Alcohol                        0.106589
 2) Malic acid                     0.025400
 3) Ash                            0.013916
 4) Alcalinity of ash              0.032033
 5) Magnesium                      0.022078
mokebe
  • 77
  • 1
  • 7

1 Answers1

0

This is not what is placed in the docs, look closely, it says

# FROM DOCS
for f in range(X.shape[1]):
    print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))

which is correct, and not

# FROM YOUR QUESTION
for f in range(X_train.shape[1]):
    print("%2d) %-*s %f" % (f + 1, 30,feat_labels[f], importances[indices[f]]))

which is wrong. If you want to use feat_labels you should do

# CORRECT SOLUTION
for f in range(X_train.shape[1]):
    print("%2d) %-*s %f" % (f + 1, 30,feat_labels[indices[f]], importances[indices[f]]))

Their approach is used because they want to iterate in decreasing order of the feature importances, not using "indices" would use ordering of features instead. Both are fine, the only incorrect one is the first one you proposed - which is a mix of both approaches and incorrectly assigns importance to features.

lejlot
  • 64,777
  • 8
  • 131
  • 164