0

I am using ray tune now for a while and it is really good! But when combining it with mlflow and keras callbacks, I have encountered problems.

My Settings:

  • Windows
  • tensorflow==2.11.0
  • ray==2.3.0
  • mlflow==2.2.1

I am using it with a tune_trainable function and a trainable function (see below) with ReportCheckpointCallback for keras and MLflowLoggerCallback for automated logging. My Code is working so far: The Trials are created, are running and my parameters and metrics are logged to mlflow, the scheduler stops the trials as assumed. The config is created with hydra and is a yaml file.

But now I have two further requests, which I am not able to solve:

  • I am logging the mean_squared_error and the mean_absolute_error, so far so good (the metrics are logged in mlflow and the scheduler is working with it to stop the trials. But I do not really know, on which datasets (train or val) the metrics are calculated. I pass both, train and val set, but only get one metric reported.
  • I want to log further customized metric within the trainable with mlflow.log_metric(key, value). For example: Passing the testset to model.evaluate(ds_test) and store the metric in mlflow after train ends. I pasted the mlfow.log_metric() in the trainable, but then I get an error: mlflow.exceptions.MlflowException: Run '3ef7584943e440a08ee53c7a70a4de53' not found. After that I tried with custom keras callback, to log a metric only for testing. But then under artifacts in mlfow, a new folder "mlflow" is created, which contains a new run with this metric (see image)enter image description here

the metrics I pass to the callbacks:

  • mean_squared_error
  • mean_absolute_error

class CustomCallback(keras.callbacks.Callback): def init(self, ds_test): superCustomCallback(self).init() self.ds_test = ds_test

def on_train_begin(self, logs=None):
    mlflow.log_metric("00_my_custom", 44)

def trainable(cfg: dict) -> None:

data_preparer = data.ingestion.DataPreparer(cfg)
ds_train, ds_val, ds_test = data_preparer.get_tf_train_val_test_datasets()

pointnet = model_HybridPointNetMeta.HybridPointNetMeta(cfg)
model = pointnet.build_model()

compiler = compile_fit.CompileFitter(cfg)
model = compiler.compile_fit_model(
    model,
    ds_train,
    ds_val,
    callbacks=[
        ReportCheckpointCallback(metrics=list(cfg.ml_trainer.METRICS)),
        CustomCallback(ds_test)
    ],
)

def tune_trainable(cfg: DictConfig) -> None:

dict_cfg = OmegaConf.to_container(cfg, resolve=True)

sched = get_asha_scheduler(cfg)
search_alg = None

tuner = tune.Tuner(
    tune.with_resources(
        trainable,
        resources={
            "cpu": cfg.ml_tuner.RESSOURCES_PER_ITER.NUM_CPU,
            "gpu": cfg.ml_tuner.RESSOURCES_PER_ITER.NUM_GPU,
        },
    ),
    run_config=air.RunConfig(
        name=cfg.ml_tuner.RUN_CONFIG.NAME,
        stop=None,
        callbacks=[
            MLflowLoggerCallback(
                tracking_uri="http://127.0.0.1:5000",
                experiment_name="Test",
                save_artifact=False,
            ),
        ],
        verbose=cfg.ml_tuner.RUN_CONFIG.VERBOSE,
    ),
    tune_config=tune.TuneConfig(
        search_alg=search_alg,
        scheduler=sched,
        metric=cfg.ml_trainer.METRICS[0],
        mode=cfg.ml_tuner.TUNE_CONFIG.MODE_METRICS,
        num_samples=cfg.ml_tuner.TUNE_CONFIG.NUM_SAMPLES,
    ),
    param_space=dict_cfg,
)
results = tuner.fit()

I also tried the mlflow_setup() within the trainable, but then I get an error, that the params are not allowed to be overwritten. The last thing I tried, is the @mlflow_mixin decorator for the trainable function. This creates trials in mlflow and logs what I want to log, but then I do not get the metrics back to ray tune to control the scheduler.

Can anyone help? Thanks! Patrick

Patrick
  • 11
  • 2

0 Answers0