0

I am using xgboost to make some predictions. We do some pre-processing, hyper-parameter tuning before fitting the model. While performing model diagnostics, we'd like to plot feature importances with feature names.

Here are the steps we've taken.

# split df into train and test
X_train, X_test, y_train, y_test = train_test_split(df.iloc[:,0:21], df.iloc[:,-1], test_size=0.2)

X_train.shape
(1671, 21)

#Encoding of categorical variables

cat_vars = ['cat1','cat2']
cat_transform = ColumnTransformer([('cat', OneHotEncoder(handle_unknown='ignore'), cat_vars)], remainder='passthrough')

encoder = cat_transform.fit(X_train)
X_train = encoder.transform(X_train)
X_test = encoder.transform(X_test)

X_train.shape
(1671, 420)

# Define xgb object
model = XGBRegressor()

# Tune hyper-parameters
r = RandomizedSearchCV(model, param_distributions=params, n_iter=200, cv=3, verbose=1, n_jobs=1)

# Fit model
r.fit(X_train, y_train)

xgb = r.best_estimator_
xgb

# Plot feature importance

plt.barh(X_train.feature_names?, xgbest.feature_importances)

X_train has encoded variable names only. And we cannot use column names with orig dataframe because of shape mismatch (21 vs 420).

kms
  • 1,810
  • 1
  • 41
  • 92
  • Does this answer your question? https://stackoverflow.com/a/63450222/12957340 – jared_mamrot Dec 02 '21 at 23:42
  • 1
    @jared_mamrot It doesn't because there is no encoding step. I am running into this issue because of encoding which changes the shape of the dataframe. – kms Dec 03 '21 at 03:05

0 Answers0