I am going through the tutorial of deep Markov models where they are trying to learn the polyphonic dataset. The Link to tutorial is:
https://pyro.ai/examples/dmm.html
This model parameterises transitions and emissions with using a neural network and for variational inference part they use RNN to map observable 'x' to latent space. And in order to ensure that their model is learning something they try to maximise ELBO or minimise negative ELBO. They refer to negative ELBO as NLL. So far I understand what they are doing. However, the next step confuses me. Once they have their NLL, they divide it by sum of sequence lengths.
times = [time.time()]
for epoch in range(args.num_epochs):
# accumulator for our estimate of the negative log likelihood
# (or rather -elbo) for this epoch
epoch_nll = 0.0
# prepare mini-batch subsampling indices for this epoch
shuffled_indices = np.arange(N_train_data)
np.random.shuffle(shuffled_indices)
# process each mini-batch; this is where we take gradient steps
for which_mini_batch in range(N_mini_batches):
epoch_nll += process_minibatch(epoch, which_mini_batch, shuffled_indices)
# report training diagnostics
times.append(time.time())
epoch_time = times[-1] - times[-2]
log("[training epoch %04d] %.4f \t\t\t\t(dt = %.3f sec)" %
(epoch, epoch_nll / N_train_time_slices, epoch_time))
And I don't quite understand why they are doing that. Can some explain? Are they averaging here? Insights would be appreciated.