clockwork neural network (CW RNN)

Question

Thanks for reading this post !

Quick question for RNN enthusiasts here :

I know that in backproprgation through time (BPPT), there is at least 3 steps :

For each element in a sequence :
Step 1 - Compute 'error ratio' of each neuron, from upper layer to lower layer.
Step 2 - Compute a 'weight delta' for each weight (X) using the error ratio mentionned in step 1, and push it into an array

After sequence is finished :
Step 3 - Sum all weight deltas of weight (X) and add it to current value of weight (X)

I am now trying to implement a clockwork RNN (CW RNN), from the documentation found here : http://jmlr.org/proceedings/papers/v32/koutnik14.pdf

From what I undertsand, each 'module' in the hidden layer has the same number of neurons, just a different clock.

The forward pass of a CW RNN seems pretty easy and intuitive.
As for the backward pass, however, that's a different story.

Quoting the documentation :

The backward pass of the error propagation is similar to
SRN as well. The only difference is that the error propagates
only from modules that were executed at time step t. The
error of non-activated modules gets copied back in time
(similarly to copying the activations of nodes not activated
at the time step t during the corresponding forward pass),
where it is added to the back-propagated error.

This is where i get confused.

Which of the above backpropagation step(s) are applied on a non-activated module in the hidden layer ?
(A module for which it's clock MOD timestep != 0)

step1 , step2 , or BOTH ?

Thanks again for your help !

From what I understand, it's Signal / Noise Ratio , which, on my end , is a concept i have yet to understand fully — Charles-Ugo Brouillard, Apr 06 '17 at 21:24
My comment was: `what is SRN`. From the paper: SRN = simple RNN (I didn't bother to read the paper) — Benjamin Crouzier, Apr 06 '17 at 21:35

score 0 · Answer 1 · answered Apr 06 '17 at 21:33

0

I'm not sure about your BPTT algorithm (if you can provide a reference, I might try to understand it better).

But after a closer look at figure 2, and equations (1) and (2), non-activated modules should just pass the gradient down through time. So that means not computing a gradient (for a non-active module), and just passing the gradient value you had at time t-1 at time t.

So I would guess neither step1 nor step2, just copy the value at the previous time step.

answered Apr 06 '17 at 21:33

Benjamin Crouzier

40,265
44
171
236

Perfect thanks ! So if i understand what you mean , when a module is active , compute its error ratio in back prop (step 1) , and this error ratio stays the same during other timesteps, until the module is activated again ? No weight deltas computing if the module has not been activated ? P.S BPTT link : https://en.m.wikipedia.org/wiki/Backpropagation_through_time – Charles-Ugo Brouillard Apr 06 '17 at 22:41

clockwork neural network (CW RNN)

1 Answers1