Passing the output of a CNN encoder to convLSTM layers. (PyTorch)

Question

I'm interested in implementing a LinkNet based encoder-decoder structure for semantic segmentation on a custom dataset. I'm trying to introduce convLSTM layers between the encoder and decoder. Typically, as expected, the output of the encoder is a 4-dim output (batch_size, channels, height, width). The convLSTM layers expect a 5-dim input (batch_size, sequence_length, channels, height, width). How do I convert this 4-dim tensor to a 5-dim tensor, without any loss of information? I initially thought of splitting the batch_size to accommodate the sequence_length as well, but that might be a problem since I'm dealing with video frames.

Maybe I'm looking at using sequences of four/five frames for training i.e. the semantic segmentation map of frame t is determined by means of the info of the last three to four frames, and hence, a sequence_length of 4 or 5 would do.

How do I introduce the sequence length? Is it during pre-processing or right after the encoder structure?

Most importantly, HOW TO DO IT?

score 0 · Answer 1 · answered Oct 28 '20 at 09:05

0

You can't. ConvLSTM expect a sequence, which is the dimension you are missing. LinkNet only takes one image as an input, so you can't really use ConvLSTM inside Linknet.

answered Oct 28 '20 at 09:05

iven

9
4

They use sequences of frames. https://arxiv.org/pdf/1905.01058.pdf – iven Oct 28 '20 at 10:20
If I understand correctly, you have to use the convlstm as encoder and decoder – iven Oct 28 '20 at 10:39

Passing the output of a CNN encoder to convLSTM layers. (PyTorch)

1 Answers1