I am using TPOT and Auto-Sklearn on a custom dataset to evaluate each pipeline they create by its accuracy and the feature importance. I have iteratively fitted a classifier and stored all the pipelines as well as their accuracies in a csv file. What I want to do is to take the pipeline from csv file one by one, fit them and then check their feature importance to store them in the same file. The issue is that I retrieve the pipeline names one by one but when I use eval() function and fit the pipeline, it requires the relevant classes to be imported. I don't know how to import them dynamically as the csv contains a variety of models, preprocessing functions used by sklearn/ auto-sklearn. How can I fit each pipeline to get their feature importance?
Here is a snapshot of my csv that holds TPOT pipelines.
Here is a snapshot of my csv that holds auto-sklearn pipelines.
Here is the code snippet.
file_df = pd.read_csv('/content/drive/MyDrive/Stefan/sunscreen_data_complex_old/TPOT_results.csv')
feature = []
feat_name = []
for pipeline_name in file_df['Pipeline']:
pipe = eval(pipeline_name)
pipe.fit(X_train, y_train)
if hasattr(pipe, 'feature_importances_'):
feature.append(max(pipe.feature_importances_))
feat_name.append(X_train.colums[np.argmax(pipe.feature_importances_)])
if hasattr(pipe, 'coef_'):
feature.append(max(pipe.coef_))
feat_name.append(X_train.colums[np.argmax(pipe.coef_)])
else:
result = permutation_importance(pipe, X_test, y_test, n_repeats=10, random_state=0)
feature.append(max(result.importances_mean))
feat_name.append(X_train.columns[np.argmax(result.importances_mean)])
file_df['Feature Importance'] = feature
file_df['Feature Name'] = feat_name
Just to add one last thing, if someone knows how to get feature importance while TPOT or Auto-sklearn finds the optimal pipeline, do guide me as I have tried a lot but they just give the importance of the optimal pipeline rather than every pipeline evaluated by them.