1

I'm using pytorch_ligthning and wandb to conduct some experiments. The problem is that training will silently crash before finishing in the following way:

Epoch 997/1000
0.087
Epoch 998/1000
0.080
wandb: Waiting for W&B process to finish... (success).
Epoch 999/1000
0.108

This is how the code looks like:

        wandb_logger.watch(embnet, 'all', log_freq=100)
        
        #Preparing data
        data.prepare_data()
        
        trainer_embnet = pl.Trainer(logger=wandb_logger,
                                    callbacks=[EmbNetCallback()],
                                    reload_dataloaders_every_n_epochs=1,
                                    max_epochs=cfg_emb.trainer.max_epochs)
        
        trainer_embnet.fit(embnet, datamodule=data)
        
        wandb_logger.experiment.finish()

I have several experiments to be run sequentially, and I call finish() at the end of each one. Also on W&B screen I notice that crashed appears next to the experiment name..

EDIT:

I think I have solved the issue by adding

wandb_logger.experiment.finalize('success')

before

wandb_logger.experiment.finish()
James Arten
  • 523
  • 5
  • 16

1 Answers1

0

Engineer from W&B here! The fact that you get this message wandb: Waiting for W&B process to finish... (success). before the actual 1000th epoch, this means there is some error happening there. Are you training on multiple GPUs and could you also share the console log in case there is any

Manan Goel
  • 19
  • 1
  • Thanks for reaching out! There's no particular message shown in the console. Also I'm simply training on CPU for now. All I can see is `crashed` message under the `State` tab on W&B platform. – James Arten Nov 09 '22 at 12:10