0

I'm interested in implementing a LinkNet based encoder-decoder structure for semantic segmentation on a custom dataset. I'm trying to introduce convLSTM layers between the encoder and decoder. Typically, as expected, the output of the encoder is a 4-dim output (batch_size, channels, height, width). The convLSTM layers expect a 5-dim input (batch_size, sequence_length, channels, height, width). How do I convert this 4-dim tensor to a 5-dim tensor, without any loss of information? I initially thought of splitting the batch_size to accommodate the sequence_length as well, but that might be a problem since I'm dealing with video frames.

Maybe I'm looking at using sequences of four/five frames for training i.e. the semantic segmentation map of frame t is determined by means of the info of the last three to four frames, and hence, a sequence_length of 4 or 5 would do.

How do I introduce the sequence length? Is it during pre-processing or right after the encoder structure?

Most importantly, HOW TO DO IT?

1 Answers1

0

You can't. ConvLSTM expect a sequence, which is the dimension you are missing. LinkNet only takes one image as an input, so you can't really use ConvLSTM inside Linknet.

iven
  • 9
  • 4