This question makes most of it pretty clear. There's just one part I don't know the answer to yet... In the fig1 of this paper, is the input to deep layers the same input (i.e. x[t]) or is it the output from the previous layer?
A really simple way to phrase the question is in fig1 of the paper, is the red line going over each layer or is it the output from the previous layer.
I think the input to all layers at time t is x[t] because if it was the output of the previous layer and x[t] wasn't the same dimension as h[t] then you'd need all the hidden GRU cells to accept a different dimension for the t input (that is the first layer would accept the hidden state and the input, but all subsequent layers would accept the corresponding hidden state from t-1 but also the hidden state from the previous layer).
But then again, in one of my classes the TA had a solution that assumed x[t] and h[t] were the same dimension so for subsequent layers he passed the preceding layers input... This just doesn't seem like it'd generally be the case.
Probably tensorflow and pytorch source code would provide a definitive answer?