Using Ray-Tune with sklearn's RandomForestClassifier

Question

Putting together different base and documentation examples, I have managed to come up with this:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

def objective(config, reporter):
  for i in range(config['iterations']):
    model = RandomForestClassifier(random_state=0, n_jobs=-1, max_depth=None, n_estimators= int(config['n_estimators']), min_samples_split=int(config['min_samples_split']), min_samples_leaf=int(config['min_samples_leaf']))
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    # Feed the score back to tune?
    reporter(precision=precision_score(y_test, y_pred, average='macro'))

space = {'n_estimators': (100,200),
        'min_samples_split': (2, 10),
        'min_samples_leaf': (1, 5)}

algo = BayesOptSearch(
    space,
    metric="precision",
    mode="max",
    utility_kwargs={
        "kind": "ucb",
        "kappa": 2.5,
        "xi": 0.0
    },
    verbose=3
    )

scheduler = AsyncHyperBandScheduler(metric="precision", mode="max")
config = {
    "num_samples": 1000,
    "config": {
        "iterations": 10,
    }
}
results = run(objective,
    name="my_exp",
    search_alg=algo,
    scheduler=scheduler,
    stop={"training_iteration": 400, "precision": 0.80},
    resources_per_trial={"cpu":2, "gpu":0.5},
    **config)

print(results.dataframe())
print("Best config: ", results.get_best_config(metric="precision"))

It runs and I am able to get a best configuration at the end of everything. However, my doubt mainly lies in the objective function. Do I have that properly written? There are no samples that I could find

Follow up question:

What is num_samples in the config object? Is it the number of samples it will extract from the overall training data for each trial?

richliaw · Accepted Answer · 2020-07-12T05:24:06.977

1

Tune now has native sklearn bindings: https://github.com/ray-project/tune-sklearn

Can you give that a shot instead?

To answer your original question, the objective function looks good; and num_samples is the total number of hyperparameter configurations you want to try.

Also, you'll want to remove the forloop from your training function:

def objective(config, reporter):
    model = RandomForestClassifier(random_state=0, n_jobs=-1, max_depth=None, n_estimators= int(config['n_estimators']), min_samples_split=int(config['min_samples_split']), min_samples_leaf=int(config['min_samples_leaf']))
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    # Feed the score back to tune
    reporter(precision=precision_score(y_test, y_pred, average='macro'))

edited Jul 12 '20 at 05:24

answered Jul 12 '20 at 04:15

richliaw

1,925
16
14

I have been trying that package also. I haven't been able to figure out how to cast to integers for certain parameters like `n_estimators`. The Bayesian space requires tuples using `TuneSearchCV`. The example located here: https://github.com/ray-project/tune-sklearn/blob/master/examples/random_forest.py uses Randomized Search. – LeggoMaEggo Jul 12 '20 at 04:24
Got it; updated answer above and will also comment again once we get TuneSearchCV to support multiple types. – richliaw Jul 12 '20 at 04:51
Thank you very much. :) Your response puts my mind to ease. I'll be looking forward to the work on `Tune-Sklearn` – LeggoMaEggo Jul 12 '20 at 05:10
(ah by the way, just took a closer look at your function, you'll want to remove the forloop in your objective function) – richliaw Jul 12 '20 at 05:23
Oh I see! With that that being done, can I also remove the `config` key and value in the `config` object? The reason why you're telling me to remove the for loop is because of redundancy? The `num_samples` in the config object is already running the trails multiple times, correct? – LeggoMaEggo Jul 12 '20 at 05:38

Using Ray-Tune with sklearn's RandomForestClassifier

1 Answers1