196

In most of the models, there is a steps parameter indicating the number of steps to run over data. But yet I see in most practical usage, we also execute the fit function N epochs.

What is the difference between running 1000 steps with 1 epoch and running 100 steps with 10 epoch? Which one is better in practice? Any logic changes between consecutive epochs? Data shuffling?

nbro
  • 15,395
  • 32
  • 113
  • 196
Yang
  • 6,682
  • 20
  • 64
  • 96
  • 2
    **Jason Brownlee** at machinelearningmastery.com has a very nice, [detailed answer](https://machinelearningmastery.com/difference-between-a-batch-and-an-epoch/) to exactly that question. – BmyGuest Apr 16 '19 at 20:12

9 Answers9

159

A training step is one gradient update. In one step batch_size examples are processed.

An epoch consists of one full cycle through the training data. This is usually many steps. As an example, if you have 2,000 images and use a batch size of 10 an epoch consists of:

2,000 images / (10 images / step) = 200 steps.

If you choose your training image randomly (and independently) in each step, you normally do not call it epoch. [This is where my answer differs from the previous one. Also see my comment.]

vvvvv
  • 25,404
  • 19
  • 49
  • 81
MarvMind
  • 3,366
  • 2
  • 21
  • 19
98

An epoch usually means one iteration over all of the training data. For instance if you have 20,000 images and a batch size of 100 then the epoch should contain 20,000 / 100 = 200 steps. However I usually just set a fixed number of steps like 1000 per epoch even though I have a much larger data set. At the end of the epoch I check the average cost and if it improved I save a checkpoint. There is no difference between steps from one epoch to another. I just treat them as checkpoints.

People often shuffle around the data set between epochs. I prefer to use the random.sample function to choose the data to process in my epochs. So say I want to do 1000 steps with a batch size of 32. I will just randomly pick 32,000 samples from the pool of training data.

chasep255
  • 11,745
  • 8
  • 58
  • 115
  • 62
    The second part of your answer is wrong, in my opinion. An epoch is defined as one cycle through the training data. It is not an epoch, if you fix the number of steps. Analogically you can't call it epoch, if you sample the training example independently in each step. You can save your checkpoint and do checks every N Steps, but this does not mean that N Steps become an epoch. I would avoid calling this epoch in the code, it has the potential to confuse. – MarvMind Jun 07 '17 at 14:45
17

As I am currently experimenting with the tf.estimator API I would like to add my dewy findings here, too. I don't know yet if the usage of steps and epochs parameters is consistent throughout TensorFlow and therefore I am just relating to tf.estimator (specifically tf.estimator.LinearRegressor) for now.

Training steps defined by num_epochs: steps not explicitly defined

estimator = tf.estimator.LinearRegressor(feature_columns=ft_cols)
train_input =  tf.estimator.inputs.numpy_input_fn({'x':x_train},y_train,batch_size=4,num_epochs=1,shuffle=True)
estimator.train(input_fn=train_input)

Comment: I have set num_epochs=1 for the training input and the doc entry for numpy_input_fn tells me "num_epochs: Integer, number of epochs to iterate over data. If None will run forever.". With num_epochs=1 in the above example the training runs exactly x_train.size/batch_size times/steps (in my case this was 175000 steps as x_train had a size of 700000 and batch_size was 4).

Training steps defined by num_epochs: steps explicitly defined higher than number of steps implicitly defined by num_epochs=1

estimator = tf.estimator.LinearRegressor(feature_columns=ft_cols)
train_input =  tf.estimator.inputs.numpy_input_fn({'x':x_train},y_train,batch_size=4,num_epochs=1,shuffle=True)
estimator.train(input_fn=train_input, steps=200000)

Comment: num_epochs=1 in my case would mean 175000 steps (x_train.size/batch_size with x_train.size=700,000 and batch_size=4) and this is exactly the number of steps estimator.train albeit the steps parameter was set to 200,000 estimator.train(input_fn=train_input, steps=200000).

Training steps defined by steps

estimator = tf.estimator.LinearRegressor(feature_columns=ft_cols)
train_input =  tf.estimator.inputs.numpy_input_fn({'x':x_train},y_train,batch_size=4,num_epochs=1,shuffle=True)
estimator.train(input_fn=train_input, steps=1000)

Comment: Although I have set num_epochs=1 when calling numpy_input_fnthe training stops after 1000 steps. This is because steps=1000 in estimator.train(input_fn=train_input, steps=1000) overwrites the num_epochs=1 in tf.estimator.inputs.numpy_input_fn({'x':x_train},y_train,batch_size=4,num_epochs=1,shuffle=True).

Conclusion: Whatever the parameters num_epochs for tf.estimator.inputs.numpy_input_fn and steps for estimator.train define, the lower bound determines the number of steps which will be run through.

A_Matar
  • 2,210
  • 3
  • 31
  • 53
dmainz
  • 985
  • 7
  • 6
17

In easy words
Epoch: Epoch is considered as number of one pass from entire dataset
Steps: In tensorflow one steps is considered as number of epochs multiplied by examples divided by batch size

steps = (epoch * examples)/batch size
For instance
epoch = 100, examples = 1000 and batch_size = 1000
steps = 100
  • Umar, I get a better result using your formula but just wondering why everyone has a different formula? Like everyone else above says, steps = (total number of images)/batch size. – Satyendra Sahani Jul 30 '19 at 12:29
  • @SatyendraSahani I got this formula from one of the instructor of GCP course offered at coursera, may be this is the case that you got better result. – Muhammad Umar Amanat Aug 01 '19 at 07:11
  • 1
    @Umar, but at some times the number of samples is huge. Like in our case we are having 99,000 samples. If we choose a batch size 8 and epochs 20. the number of total step_size is (20*99000)/8 = 247,500. Which is really a high number. there I start doubting this method. – Satyendra Sahani Aug 06 '19 at 10:20
14

Epoch: A training epoch represents a complete use of all training data for gradients calculation and optimizations(train the model).

Step: A training step means using one batch size of training data to train the model.

Number of training steps per epoch: total_number_of_training_examples / batch_size.

Total number of training steps: number_of_epochs x Number of training steps per epoch.

Xin
  • 331
  • 1
  • 3
  • 8
  • Just to add on to this, if there is a validation set of size `V`, then the number of training steps per epoch is `(total_number_of_training_examples - V)`/`batch_size` – Noam Suissa Aug 19 '21 at 16:54
5

According to Google's Machine Learning Glossary, an epoch is defined as

"A full training pass over the entire dataset such that each example has been seen once. Thus, an epoch represents N/batch_size training iterations, where N is the total number of examples."

If you are training model for 10 epochs with batch size 6, given total 12 samples that means:

  1. the model will be able to see whole dataset in 2 iterations ( 12 / 6 = 2) i.e. single epoch.

  2. overall, the model will have 2 X 10 = 20 iterations (iterations-per-epoch X no-of-epochs)

  3. re-evaluation of loss and model parameters will be performed after each iteration!

Divi
  • 101
  • 1
  • 3
4

Since there’re no accepted answer yet : By default an epoch run over all your training data. In this case you have n steps, with n = Training_lenght / batch_size.

If your training data is too big you can decide to limit the number of steps during an epoch.[https://www.tensorflow.org/tutorials/structured_data/time_series?_sm_byp=iVVF1rD6n2Q68VSN]

When the number of steps reaches the limit that you’ve set the process will start over, beginning the next epoch. When working in TF, your data is usually transformed first into a list of batches that will be fed to the model for training. At each step you process one batch.

As to whether it’s better to set 1000 steps for 1 epoch or 100 steps with 10 epochs I don’t know if there’s a straight answer. But here are results on training a CNN with both approaches using TensorFlow timeseries data tutorials :

In this case, both approaches lead to very similar prediction, only the training profiles differ.

steps = 20 / epochs = 100 enter image description here

enter image description here

steps = 200 / epochs = 10

enter image description here

enter image description here

Yoan B. M.Sc
  • 1,485
  • 5
  • 18
2

Divide the length of x_train by the batch size with

steps_per_epoch = x_train.shape[0] // batch_size
kiriloff
  • 25,609
  • 37
  • 148
  • 229
1

We split the training set into many batches. When we run the algorithm, it requires one epoch to analyze the full training set. An epoch is composed of many iterations (or batches).

Iterations: the number of batches needed to complete one Epoch.

Batch Size: The number of training samples used in one iteration.

Epoch: one full cycle through the training dataset. A cycle is composed of many iterations.

Number of Steps per Epoch = (Total Number of Training Samples) / (Batch Size)

Example Training Set = 2,000 images Batch Size = 10

Number of Steps per Epoch = 2,000 / 10 = 200 steps

Hope this helps for better understanding.