I'm using pytorch_ligthning
and wandb
to conduct some experiments. The problem is that training will silently crash before finishing in the following way:
Epoch 997/1000
0.087
Epoch 998/1000
0.080
wandb: Waiting for W&B process to finish... (success).
Epoch 999/1000
0.108
This is how the code looks like:
wandb_logger.watch(embnet, 'all', log_freq=100)
#Preparing data
data.prepare_data()
trainer_embnet = pl.Trainer(logger=wandb_logger,
callbacks=[EmbNetCallback()],
reload_dataloaders_every_n_epochs=1,
max_epochs=cfg_emb.trainer.max_epochs)
trainer_embnet.fit(embnet, datamodule=data)
wandb_logger.experiment.finish()
I have several experiments to be run sequentially, and I call finish()
at the end of each one. Also on W&B screen I notice that crashed
appears next to the experiment name..
EDIT:
I think I have solved the issue by adding
wandb_logger.experiment.finalize('success')
before
wandb_logger.experiment.finish()