-1

I wrote a program which contains an algorithm called distributed randomized gradient descent (DRGD). There are some internal variables in the algorithm which are used to calculate the step lengths. The training algorithms should be much complex than DRGD, so there should be more internal variables. If we preserve these variables, we can pause training and test the model; then, we will resume the training again.

Blue Bird
  • 318
  • 2
  • 9

2 Answers2

0

If you want to store some data across multiple devices (GPUs or machines) you can use KVStore. Here is the tutorial on how to use it.

Please note, that KVStore is considered to be quite an advanced feature, and should be used with care.

I am not sure, but it could be that what you call a "Trainer" in MXNet world may actually be called an "Optimizer". So, please consider reading this API page as well.

Sergei
  • 1,617
  • 15
  • 31
0

It is possible to save the states of the trainer and resume training by calling the .save_states() and .load_states() functions on the Trainer class during a training with MXNet Gluon.

Here is an example:

trainer = gluon.Trainer(net.collect_params(), 'adam')
trainer.save_states('training.states')
trainer.load_states('training.states')
Thomas
  • 676
  • 3
  • 18