I'm interested in implementing a LinkNet based encoder-decoder structure for semantic segmentation on a custom dataset. I'm trying to introduce convLSTM layers between the encoder and decoder. Typically, as expected, the output of the encoder is a 4-dim output (batch_size, channels, height, width). The convLSTM layers expect a 5-dim input (batch_size, sequence_length, channels, height, width). How do I convert this 4-dim tensor to a 5-dim tensor, without any loss of information? I initially thought of splitting the batch_size to accommodate the sequence_length as well, but that might be a problem since I'm dealing with video frames.
Maybe I'm looking at using sequences of four/five frames for training i.e. the semantic segmentation map of frame t is determined by means of the info of the last three to four frames, and hence, a sequence_length of 4 or 5 would do.
How do I introduce the sequence length? Is it during pre-processing or right after the encoder structure?
Most importantly, HOW TO DO IT?