I'm implementing a LightGBM Classifier (LGBMClassifier
) whose hyperparameters are chosen by a RandomizedSearchCV
cross-validation (sklearn
library).
I have used some arbitrary values for either the param_distributions
and fit_params
, but how should I choose them?
In my case, I'm working with genetic data, and I have a dataset of 2,504 rows and 220,001 columns. I'm wondering if there's any algorithm/calculation I can use to choose each of the testable parameters' ranges?
Here's an code snippet I borrowed from this Kaggle kernel:
fit_params = {"early_stopping_rounds" : 50, # TODO: Isn't it too low for GWAS?
"eval_metric" : 'binary',
"eval_set" : [(X_test,y_test)],
'eval_names': ['valid'],
'verbose': 0,
'categorical_feature': 'auto'}
param_test = {'learning_rate' : [0.01, 0.02, 0.03, 0.04, 0.05, 0.08, 0.1, 0.2, 0.3, 0.4],
'n_estimators' : [100, 200, 300, 400, 500, 600, 800, 1000, 1500, 2000, 3000, 5000],
'num_leaves': sp_randint(6, 50),
'min_child_samples': sp_randint(100, 500),
'min_child_weight': [1e-5, 1e-3, 1e-2, 1e-1, 1, 1e1, 1e2, 1e3, 1e4],
'subsample': sp_uniform(loc=0.2, scale=0.8),
'max_depth': [-1, 1, 2, 3, 4, 5, 6, 7],
'colsample_bytree': sp_uniform(loc=0.4, scale=0.6),
'reg_alpha': [0, 1e-1, 1, 2, 5, 7, 10, 50, 100],
'reg_lambda': [0, 1e-1, 1, 5, 10, 20, 50, 100]}
#number of combinations
n_iter = 200 #(replace 2 by 200, 90 minutes)
#intialize lgbm and lunch the search
lgbm_clf = lgbm.LGBMClassifier(random_state=random_state, silent=True, metric='None', n_jobs=4)
grid_search = RandomizedSearchCV(
estimator=lgbm_clf, param_distributions=param_test,
n_iter=n_iter,
scoring='accuracy',
cv=5,
refit=True,
random_state=random_state,
verbose=True)
Getting the question more focused, how do I choose, for example, how many iterations I need for early_stopping_rounds
and the n_iter
?