-1

I have the following code that does a random forest regression to see feature importance. I would like to do cross validation or k-folds. Here is my code for doing the regression, which gives me the features and their ranks. I have attempted transforming some code I found online to add cross validation to it but have so far had no success. Any ideas? I am not dividing the data into test/train sets.

df = pd.read_csv(dataset_path + file_name)

X = df.drop(['target'], axis = 1)
y= df['target']

clf = RandomForestRegressor(random_state =  42, n_jobs=-1)
# Train model
model = clf.fit(X, y)

feat_importances = pd.DataFrame(model.feature_importances_, index = X.columns, columns=["Importance"])
feat_importances.sort_values(by='Importance', ascending=False, inplace=True)
feat_importances.plot(kind='bar', figsize=(8,6))
nic.o
  • 61
  • 8
  • 1
    Have a look at `Permutation feature importance` [here](https://scikit-learn.org/stable/modules/permutation_importance.html). – seralouk Dec 30 '22 at 21:16
  • ... especially point 4.2.2. Random forest (MDI-) feature importance might not be what you looking for, but as @seralouk points out permutation feature importance or SHAP – DataJanitor Jan 12 '23 at 12:24

1 Answers1

0

You can consider using Kfold with .split(). This will randomly split the data into k folds (as cross-validation does) and then get the index values of the training and the test.

Your code will look like this:


importances_per_fold = []

CV = KFold(n_splits=5, shuffle=True, random_state=10)

ix_training, ix_test = [], []
# Loop through each fold and append the training & test indices to the empty lists above
for fold in CV.split(df):
    ix_training.append(fold[0]), ix_test.append(fold[1])

for i, (train_outer_ix, test_outer_ix) in enumerate(zip(ix_training, ix_test)): 
    
    X_train, X_test = X.iloc[train_outer_ix, :], X.iloc[test_outer_ix, :]
    y_train, y_test = y.iloc[train_outer_ix], y.iloc[test_outer_ix]

    clf = RandomForestRegressor(random_state =  42, n_jobs=-1)
    # Train model
    model = clf.fit(X_train, y_train)

    importances_per_fold.append(model.feature_importances_)

# Get mean feature importance across all folds
    
av_importances = np.mean(importances_per_fold, axis = 0)

feat_importances = pd.DataFrame(av_importances, index = X.columns, columns=["Importance"])
feat_importances.sort_values(by='Importance', ascending=False, inplace=True)
feat_importances.plot(kind='bar', figsize=(8,6))

Most of this code was adapted from this implementation of SHAP values with cross-validation. This way of assessing feature importance is much more reliable than using the built-in feature importance from sklearn.

Derek
  • 61
  • 3