-1

I'm implementing a LightGBM Classifier (LGBMClassifier) whose hyperparameters are chosen by a RandomizedSearchCV cross-validation (sklearn library).

I have used some arbitrary values for either the param_distributions and fit_params, but how should I choose them?

In my case, I'm working with genetic data, and I have a dataset of 2,504 rows and 220,001 columns. I'm wondering if there's any algorithm/calculation I can use to choose each of the testable parameters' ranges?

Here's an code snippet I borrowed from this Kaggle kernel:

fit_params = {"early_stopping_rounds" : 50, # TODO: Isn't it too low for GWAS?
             "eval_metric" : 'binary', 
             "eval_set" : [(X_test,y_test)],
             'eval_names': ['valid'],
             'verbose': 0,
             'categorical_feature': 'auto'}

param_test = {'learning_rate' : [0.01, 0.02, 0.03, 0.04, 0.05, 0.08, 0.1, 0.2, 0.3, 0.4],
              'n_estimators' : [100, 200, 300, 400, 500, 600, 800, 1000, 1500, 2000, 3000, 5000],
              'num_leaves': sp_randint(6, 50), 
              'min_child_samples': sp_randint(100, 500), 
              'min_child_weight': [1e-5, 1e-3, 1e-2, 1e-1, 1, 1e1, 1e2, 1e3, 1e4],
              'subsample': sp_uniform(loc=0.2, scale=0.8), 
              'max_depth': [-1, 1, 2, 3, 4, 5, 6, 7],
              'colsample_bytree': sp_uniform(loc=0.4, scale=0.6),
              'reg_alpha': [0, 1e-1, 1, 2, 5, 7, 10, 50, 100],
              'reg_lambda': [0, 1e-1, 1, 5, 10, 20, 50, 100]}

#number of combinations
n_iter = 200 #(replace 2 by 200, 90 minutes)

#intialize lgbm and lunch the search
lgbm_clf = lgbm.LGBMClassifier(random_state=random_state, silent=True, metric='None', n_jobs=4)
grid_search = RandomizedSearchCV(
    estimator=lgbm_clf, param_distributions=param_test, 
    n_iter=n_iter,
    scoring='accuracy',
    cv=5,
    refit=True,
    random_state=random_state,
    verbose=True)

Getting the question more focused, how do I choose, for example, how many iterations I need for early_stopping_rounds and the n_iter?

khelwood
  • 55,782
  • 14
  • 81
  • 108
Bruno Ambrozio
  • 402
  • 3
  • 18

2 Answers2

1

RandomizedSearchCV will return the best choice of array of inputs for each parameter, for example: it will return 0.4 from 'learning_rate' : [0.01, 0.02, 0.03, 0.04, 0.05, 0.08, 0.1, 0.2, 0.3, 0.4] if the last element of the learning_rate array is the best fit. n_iter is an integer and can't be chosen by passing an array, so you have to do the grid_search yourself.

Abdirahman
  • 180
  • 1
  • 5
  • What about `early_stopping_rounds`? Is it the same idea as the `n_inter`? Do I have to use arbitrary numbers and see which one is better? I was wondering if there's a way to calculate it.. Could you please elaborate more about "do the grid_search yourself"? Thanks for the help! – Bruno Ambrozio Mar 21 '20 at 09:20
  • Btw - It's clear about the values of the array, my question in this point is, how to choose which values should be in the arrays to be tested. – Bruno Ambrozio Mar 21 '20 at 09:23
  • Start with small values and increment, then try and other array starting with the end value of the first array and so on till you find the best value to choose. And yes it's also true for early stoping. – Abdirahman Mar 22 '20 at 11:16
1

"I have used some arbitrary values for either the param_distributions and fit_params, but how should I choose them?". My advice is to take values around the default defined by sklearn. Actually, depending on the problem and the algorithm you use you can try some guided guess. For example, there is some research work that states that generally Random Forest provides better results when 100 <= n_estimators <= 500. You can start in such a way, but if you really need to find the (sub)optimal parameters you can use optimization algorithms, such as Genetic Algorithms, that start from random values and try to converge towards the optimal values.

s.dallapalma
  • 1,225
  • 1
  • 12
  • 35