Recently, I've found some papers about generative recurrent models. All have attached sub-networks like prior/encoder/decoder/etc. to well-known LSTM cell for composing an aggregation of new-type RNN cell.
I am just curious about whether the gradient vanishing/exploding happens or not to those new RNN cell. Isn't there any problem about that kind of combination?
References:
It seems like they all have similar pattern as mentioned above.
A Recurrent Latent Variable Model for Sequential Data
Pseudocode
The pseudocode for recurrent architecture is below:
def new_rnncell_call(x, htm1):
#prior_net/posterior_net/decoder_net is single layer or mlp each
q_prior = prior_net(htm1) # prior step
q = posterior_net([htm1, x]) # inference step
z = sample_from(q) # reparameterization trick
target_dist = decoder_net(z) # generation step
ht = innerLSTM([z, x], htm1) # recurrent step
return [q_prior, q, target_dist], ht
What concerns me are those naked weights outside of well-known LSTM (or GRU etc.) cell during processing bptt without any gating logic for activations as the weights inside LSTM. For me, this looks not similar to stacked-rnn layers or additional dense layers just to outputs.
Doesn't that have any gradient vanishing/exploding problem?