Questions:
- First question is probably extremely stupid but I will ask anyway: Is the pruning and the early stopping the same in this example below? Or is it two separate separate options controlling two separate processes?
- I got an imbalanced target, so how can I use a custom evaluation metric here instead of 'binary_logloss' such as e.g. balanced accuracy?
- When I get the optimal parameters, the 'n_estimators' will still equal 999999. Using an "infinite" number of estimators and prune using early stopping is recommended for imbalanced target so that's why it's so high. How do fit the final model with the optimal n_estimators post pruning?
Thank you very much for helping me out with this I am quite frustrated.
def objective(trial, X, y):
param_grid = {
# "device_type": trial.suggest_categorical("device_type", ['gpu']),
"n_estimators": trial.suggest_categorical("n_estimators", [999999]),
"learning_rate": trial.suggest_float("learning_rate", 0.01, 0.3),
"num_leaves": trial.suggest_int("num_leaves", 20, 3000, step=20),
"max_depth": trial.suggest_int("max_depth", 3, 12),
"min_data_in_leaf": trial.suggest_int("min_data_in_leaf", 200, 10000, step=100),
"lambda_l1": trial.suggest_int("lambda_l1", 0, 100, step=5),
"lambda_l2": trial.suggest_int("lambda_l2", 0, 100, step=5),
"min_gain_to_split": trial.suggest_float("min_gain_to_split", 0, 15),
"bagging_fraction": trial.suggest_float(
"bagging_fraction", 0.2, 0.95, step=0.1
),
"bagging_freq": trial.suggest_categorical("bagging_freq", [1]),
"feature_fraction": trial.suggest_float(
"feature_fraction", 0.2, 0.95, step=0.1
),
}
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=1121218)
cv_scores = np.empty(5)
for idx, (train_idx, test_idx) in enumerate(cv.split(X, y)):
X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]
model = LGBMClassifier(
objective="binary",
**param_grid,
n_jobs=-1,
scale_pos_weight=len(y_train) / y_train.sum()
)
model.fit(
X_train,
y_train,
eval_set=[(X_test, y_test)],
eval_metric="binary_logloss", # replace this with e.g. balanced accuracy or f1
callbacks=[
LightGBMPruningCallback(trial, "binary_logloss"), # replace this with e.g. balanced accuracy or f1
early_stopping(100, verbose=False)
],
)
preds = model.predict(X_test)#.argmax(axis=1)
cv_scores[idx] = balanced_accuracy_score(y_test, preds)
loss = 1 - np.nanmedian(cv_scores)
return loss
Run:
study = optuna.create_study(direction="minimize", study_name="LGBM Classifier")
func = lambda trial: objective(trial, X_train, y_train)
study.optimize(func, n_trials=1)
Fit the final problem. But here I don't want to fit with n_estimators=999999, but with the optimal number of n_estimators:
model = LGBMClassifier(
objective="binary",
**study.best_params,
n_jobs=-1,
scale_pos_weight=len(y) / y.sum()
)