I am using xgboost to make some predictions. We do some pre-processing, hyper-parameter tuning before fitting the model. While performing model diagnostics, we'd like to plot feature importances with feature names.
Here are the steps we've taken.
# split df into train and test
X_train, X_test, y_train, y_test = train_test_split(df.iloc[:,0:21], df.iloc[:,-1], test_size=0.2)
X_train.shape
(1671, 21)
#Encoding of categorical variables
cat_vars = ['cat1','cat2']
cat_transform = ColumnTransformer([('cat', OneHotEncoder(handle_unknown='ignore'), cat_vars)], remainder='passthrough')
encoder = cat_transform.fit(X_train)
X_train = encoder.transform(X_train)
X_test = encoder.transform(X_test)
X_train.shape
(1671, 420)
# Define xgb object
model = XGBRegressor()
# Tune hyper-parameters
r = RandomizedSearchCV(model, param_distributions=params, n_iter=200, cv=3, verbose=1, n_jobs=1)
# Fit model
r.fit(X_train, y_train)
xgb = r.best_estimator_
xgb
# Plot feature importance
plt.barh(X_train.feature_names?, xgbest.feature_importances)
X_train has encoded variable names only. And we cannot use column names with orig dataframe because of shape mismatch (21 vs 420).