I am looking at code for an RNN Language Model. I am confused as to 1) how the training pairs (x,y) are constructed and subsequently 2) how the loss is computed. The code borrows from the Tensorflow RNN tutorial ( reader module ).
Within the reader module, a generator, ptb_iterator, is defined. It takes in the data as one sequence and yields x,y pairs in accordance to the batch size and the number of steps you wish to 'unroll' the RNN. It is best to look at the entire definition first but the part that confused me is this:
for i in range(epoch_size):
x = data[:, i*num_steps:(i+1)*num_steps]
y = data[:, i*num_steps+1:(i+1)*num_steps+1]
yield (x, y)
which is documented as:
*Yields:
Pairs of the batched data, each a matrix of shape [batch_size, num_steps].
The second element of the tuple is the same data time-shifted to the
right by one.*
So if understand correctly, for the data sequence [1 2 3 4 5 6]
and num_steps = 2
then for stochastic gradient descent(i.e. batch_size=1) the following pairs will be generated:
- x=[1,2] , y=[2,3]
- x=[3,4] , y=[5,6]
1) Is this the correct way to do this? Should it not be done so that the pairs are:
- x=[1,2] , y=[2,3]
- x=[2,3] , y=[3,4] ... # allows for more datapoints
OR
- x=[1,2] , y=[3]
- x=[2,3] , y=[4] ... # ensures that all predictions are made with context length = num_steps
2) Lastly, given that the pairs are generated as they are in the reader module, when it comes to training, will the loss computed not reflect the RNN's performance over a range of unrolled steps instead of num_steps
specified?
For example, the model will make a prediction for x=3 (from x=[3,4]) without considering that 2 came before it (i.e. unrolling the RNN one step instead of two).