How to use StratifiedKFold with wandb sweeps?

Question

I have the following piece of code - it is a train function for Logistic regression. I run sweeps to be able to compare hyperparameter tuning results. My issue is that I don't know how to incorporate StratifiedKFold to work with sweeps. I would appreciate it if someone can help me modify my code:

def train(
    confing=None,
    X_train = features_train,
    y_train = labels_train,
    X_test = features_test,
    y_test = labels_test
):
    with wandb.init(
        project=WANDB_PROJECT_NAME, 
        entity="name", 
        config=config_defaults,
        tags=['logistic regression', 'tf-idf', 'l2', 'class weight', 'C'],
        notes='Logistic regression run with several regularizations and with either None penalty or l2 penalty, and ''balanced'' or pre-calculated class_weight.'
    ):
        config = wandb.config
        
        log_reg = LogisticRegression(
            penalty=config.penalty,
            C = config.C,
            class_weight = config.class_weight
        )
        
        log_reg.fit(X_train, y_train)
        
        y_pred = log_reg.predict(X_test)
        y_proba = log_reg.predict_proba(X_test)
        labels=list(map(str,y_labels['label'].unique()))
        
        # Visualize single plot
        cm = wandb.sklearn.plot_confusion_matrix(y_test, y_pred, labels)
        
        score_f1 = f1_score(y_test, y_pred, average='weighted')
        
        sm = wandb.sklearn.plot_summary_metrics(
        log_reg, X_train, y_train, X_test, y_test)
        
        roc = wandb.sklearn.plot_roc(y_test, y_proba)
        
        wandb.log({
            "f1-weighted-log-regr-1": score_f1, 
            "roc-log-regr-1": roc, 
            "conf-mat-log-regr-1": cm,
            "summary-metrics-log-regr-1": sm
            })

sweep_id = wandb.sweep(sweep_config, project="log-regr")
wandb.agent(sweep_id, function=train)

Claudia Tinnirello · Answer 1 · 2022-08-31T15:55:30.760

The following code should solve your issue. For getting multiple metrics, you can consider 'cross_validate' function instead of 'cross_val_score' function.

from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score


def train(
    confing=None,
    X_train = features_train,
    y_train = labels_train,
    X_test = features_test,
    y_test = labels_test
):
    with wandb.init(
        project=WANDB_PROJECT_NAME, 
        entity="name", 
        config=config_defaults,
        tags=['logistic regression', 'tf-idf', 'l2', 'class weight', 'C'],
        notes='Logistic regression run with several regularizations and with either None penalty or l2 penalty, and ''balanced'' or pre-calculated class_weight.'
    ):
        config = wandb.config
        
        log_reg = LogisticRegression(
            penalty=config.penalty,
            C = config.C,
            class_weight = config.class_weight
        )
        
        cv = StratifiedKFold(n_splits=3, shuffle=True)


        scores = cross_val_score(log_reg,
                                 X_train,
                                 y_train,
                                 scoring='recall', # replace 'recall' with you favorite metric
                                 cv=cv)
        wandb.log({"Mean score":scores.mean()})


sweep_id = wandb.sweep(sweep_config, project="log-regr")
wandb.agent(sweep_id, function=train)

score -1 · Answer 2 · answered Aug 26 '22 at 11:39

From this answer:

We put together an example of how to accomplish k-fold cross validation:

https://github.com/wandb/examples/tree/master/examples/wandb-sweeps/sweeps-cross-validation

The solution requires some contortions for the wandb library to spawn multiple jobs on behalf of a launched sweep job.

The basic idea is:

The agent requests a new set of parameters from the cloud hosted parameter server. This is the run called sweep_run in the main function. Send information about what the folds should process over a multiprocessing queue to waiting processes Each spawned process logs to their own run, organized with group and job_type to enable auto-grouping in the UI When the process is finished, it sends the primary metric over a queue to the parent sweep run The sweep run reads metrics from the child runs and logs it to the sweep run so that the sweep can use that result to impact future parameter choices and/or hyperband early termination optimizations Example visualizations of the sweep and k-fold grouping can be seen here:

Sweep: https://app.wandb.ai/jeffr/examples-sweeps-cross-validation/sweeps/vp0fsvku K-fold Grouping: https://app.wandb.ai/jeffr/examples-sweeps-cross-validation/groups/vp0fsvku

It is difficult for me to understand where stratified K fold will be used. I see some functions, but no example with a dataset. — Yana, Aug 26 '22 at 13:01

How to use StratifiedKFold with wandb sweeps?

2 Answers2