0

I am using Google Vertex AI to train models, and I am not sure what this parameter is specifying. I noticed that in some Vertex AI tutorials this value was also given a variable value called 'NUM_EPOCHS'. Looking at the Github for the package doesn't add much clarity.

I'm not sure how this can be referring to the number of epochs that the model is trained with, as I feel that can be done more easily just by writing code (and its default value, 1000, seems absurdly high). What does this parameter mean?

RichMash
  • 29
  • 2
  • I believe it is related to this question too: https://stackoverflow.com/questions/41166681/what-does-global-step-mean-in-tensorflow – Nestor Ceniza Jr Aug 10 '22 at 05:59
  • Hi @RichMash, If my answer addressed your question, please consider accepting and upvoting it. If not, let me know so that I can improve my answer.Accepting an answer will help the community members with their research as well. – Shipra Sarkar Aug 11 '22 at 05:12
  • This is a good question. I am not sure if it's related to the other question linked, because that has to do w/ global_step in Tensorflow. This question is for reporting a global step when reporting hyperparameter tuning metrics (of any model/framework) to Vertex AI. My guess is that for its optimization Vertex would like you to report how many steps your local model has made in its training along with the metric name/value you are reporting. Google's documentation could certainly be improved in this regard. – Stephen Aug 30 '22 at 14:32
  • Hi @Stephen, the global_step in reporting hyperparameter tuning metrics to Vertex AI refers to the number of batches seen by the graph. I have updated my answer. – Shipra Sarkar Sep 01 '22 at 07:23
  • Hi @Stephen, I have added additional details in the answer. – Shipra Sarkar Sep 12 '22 at 07:19

1 Answers1

-1

global_step in the Training Step is assigned into the report_hyperparameter_tuning_metric function which is used to define the number of batches that a graph can see as mentioned in this StackOverflow question. It represents how many batches has the model seen during training, from its start until now.

The function report_hyperparameter_tuning_metric is used to record and dump to the file the value of some metric (e.g. loss) in order to understand how well the model is performing. It takes the metric value and the step number (representing how many steps has passed which means how many batches did the model see and records this data point. This function needs to be called after every step (model sees the batch, updates the weights and the metrics values and calls this function), so that the training metrics will be recorded in a 2D plot (number of steps/metric). This step number equals the value of global_step which is used to keep track of the number of batches.

The global_step is used to keep track of the number of batches seen.It must be an integer variable.Each time a batch is provided, the weights are updated in a direction that minimizes the loss. When global_step is used with optimizer.minimize(), the variable is increased by one in the global_step argument.

Shipra Sarkar
  • 1,385
  • 3
  • 10