0

This is my first time trying to use Ray.Tune for hyperparameter optimization. I am confused as to where in the Ray code I should initialize the dataset as well as where to put the for-loops for defining the epoch and enumerating the dataset batches.

Background

In my normal training script, I follow several steps:
1. Parse the model options,
2. Initialize the dataset,
3. Create and initialize the model,
4. For-loop for progressing through the epochs,
5. Nested for-loop for enumerating the dataset batches

The Ray.Tune documentation says that when defining the Trainable class object, that I really only need _setup, _train, _save, and _restore. As I understand it, the _train() is for a single iteration and increments the training_iteration automatically. Given that the dataset_size may not be cleanly divisible by the batchSize, I calculate the total_steps as the training progresses. If I understand it right, my total_steps will not be equal to training_iteration. This is important because the number of steps is supposed to be used to determine when to evaluate the worker.

I also do not want to instantiate the dataset for each worker individually. Ray should instantiate the dataset once, and then the workers can access the data via shared memory.

Original train.py code

self.opt = TrainOptions().parse()
data_loader = CreateDataLoader(self.opt)
self.dataset = data_loader.load_data()
self.dataset_size = len(data_loader)

total_steps = 0
counter = 0
for epoch in range(self.opt.starting_epoch, self.opt.niter + self.opt.niter_decay + 1):
    for i, data in enumerate(self.dataset):
        total_steps += self.opt.batchSize if i<len(self.dataset) else (self.dataset_size * (epoch + 1)) - total_steps
        counter += 1
        self.model.set_input(data, self.opt)
        self.model.optimizeD()
        if counter % self.opt.critic_iters == 0:
            self.model.optimizeG()
LaMaster90
  • 11
  • 1
  • Can you provide more context on not "instantiate the dataset for each worker individually"? Have you tried putting it into Plasma? – richliaw Oct 12 '19 at 09:06

1 Answers1

0

The training_iteration is just a logical unit of training. It would not be a problem to use one epoch per "training_iteration".

richliaw
  • 1,925
  • 16
  • 14