What are the uses of TimeDistributed wrapper for LSTM or any other layers

Question

I am trying to understand the use of TimeDistributed layer in keras/tensorflow. I have read some threads and articles but still I didn't get it properly.

The threads that gave me some understanding of what the TImeDistributed layer does are -

What is the role of TimeDistributed layer in Keras?

TimeDistributed(Dense) vs Dense in Keras - Same number of parameters

But I still don't know why the layer is actually used!

For example, both the below codes will provide same output (& output_shape):

model = Sequential()
model.add(TimeDistributed(LSTM(5, input_shape = (10, 20), return_sequences = True)))
print(model.output_shape)

model = Sequential()
model.add(LSTM(5, input_shape = (10, 20), return_sequences = True))
print(model.output_shape)

And the output shape will be (according to my knowledge) -

(None, 10, 5)

So, if both the models provide same output, what is actually the use of TimeDistributed Layer?

And I also had one other question. TimeDistributed layer applies time related data to separate layers (sharing same weights). So, how is it different from unrolling the LSTM layer which is provided in keras API as:

unroll: Boolean (default False). If True, the network will be unrolled, else a symbolic loop will be used. Unrolling can speed-up a RNN, although it tends to be more memory-intensive. Unrolling is only suitable for short sequences.

What is the difference between these two?

Thank you.. I am still a newbie and so have many questions.

SaTa · Accepted Answer · 2020-07-07T21:28:14.547

24

As Keras documentation suggests TimeDistributed is a wrapper that applies a layer to every temporal slice of an input.

Here is an example which might help:

Let's say that you have video samples of cats and your task is a simple video classification problem, returning 0 if the cat is not moving or 1 if the cat is moving. Let's assume your input dim is (None, 50, 25, 25, 3) which means you have 50 time steps or frames per sample, and your frames are 25 by 25 and have 3 channels, rgb.

Well, one aporoach would be to extract some "features" from each frame using CNN, like Conv2D, and then pass them to an LSTM layer. But the feature extraction would be the same for each frame. Now TimeDistributed comes to the rescue. You can wrap your Conv2D with it, then pass the output to a Flatten layer wrapped also by TimeDistributed. So after applying TimeDistributed(Conv2D(...)), the output would be something of dim like (None, 50, 5, 5, 16), and after TimeDistributed(Flatten()), the output would be of dim (None, 50, 400). (The actual dim would depend on Conv2D parameters.)

The output at this layer now can be passes through LSTM.

So obviously, LSTM itself does not need a TimeDistributed wrapper.

edited Jul 07 '20 at 21:28

answered Nov 07 '18 at 05:08

SaTa

2,422
2
14
26

AWESOME explanation, you just made me understand what TimeDistribuited is useful for. Kudos! – Asynchronousx Jan 30 '20 at 10:38
@SaTa can you explain why the feature extraction would be the same for each frame? – AlwaysNull Apr 29 '20 at 18:46
@AlwaysNull that's how I have seen it happening most of the time. Having a fixed CNN network across time. Do you mean why not have different architecture at each time step? – SaTa Apr 29 '20 at 20:28
@AlwaysNull because the 'cat' doesn't transform into a 'desk' from one frame to the next. Only variations of the cat's pose need to be inferred by the network behind the LSTM layer. – Unknown Jul 06 '20 at 17:14
@Unknown, "feature extraction would be the same for each frame" not "mostly the same" because there is a single CNN that is getting trained. Thus the same features are extracted at each time step. The won't have the same value though, but they are the same features. One simple example is that CNN learns to return the mean and max of pixel values as two feature. These feature would stay the same functions for all the frames, but have different value depending on the pixels at each time frame. – SaTa Jul 07 '20 at 21:32
Your previous defense was "that's how I have seen it happening most of the time" lol I cleared that up for you. You should be thankful, instead of protecting your fragile ego like this. – Unknown Jul 08 '20 at 18:11
@Unknown, sorry I am not seeing what you are talking about. SO showed me an edit that "mostly" was added to the answer, which is not true. There is no mostly here. CNN is the same for each time step. So I wrote the above comment. If you have an edit without "mostly", feel free to propose it. This is a technical forum, if you have thoughts with proof, please communicate and avoid this kind of language. – SaTa Jul 08 '20 at 23:31
I see now that you are referring to my comment above. My comment that "that's how I have seen it happening most of the time" refers to the architecture of CNN+LSTM. You proposed "mostly" to the feature extraction. Which again is not true. To be clear, with this architecture that I have seen "most" of the time, the feature extraction is the same at each time step. I hope this clears things. Feel free to propose edits to reflect the this if needed. – SaTa Jul 08 '20 at 23:39
@SaTa by the same feature extraction for each time step, do you mean to say that the filters would be the same across all of those points? – Vishal Balaji Jul 22 '22 at 10:24
1

@VishalBalaji, yes. – SaTa Jul 23 '22 at 00:04

What are the uses of TimeDistributed wrapper for LSTM or any other layers

1 Answers1