1

I am currently working on a sequence model which aims to predict the head orientation of somebody watching VR for an arbitrary amount of frames in the future.

Using an encoder-decoder paradigm, a person's previous 100 frames of head orientation and the video's previous 100 frames of data are fed into an encoder to create a context. (100 is arbitrary here, but for sake of writing I will use 100 frames, so the previous sequence's data will be [ T(-100) - T0 ] )

The context is then fed into the decoder, and frame by frame outputting a probability distribution of where the viewer will be looking at T1 for the first iteration of decoding, T2 for the second iteration, etc.

The context vector therefore has a lot of weight on its shoulders, encoding data from the entire previous sequence. So in order to help with the decoder's output, at each step of decoding I want to let the decoder have access to the video's future 100 frames of data. So at time T2, the decoder will be on its second iteration and have access to the 102nd video frame to use for convolution, saliency, etc.

The way we were thinking to accomplish this is to have the encoder re-encode the context vector on each iteration of decoding the context, and replacing the encoder's input to contain the future video frame data.

So for example on the second iteration of decoding, we would re-encode the context, but replace the data for T(-100) with the decoded output for T1, and the video frame for T(-100) with the video frame for T2. We would continue to iterate through decoding in this way.

The thought is this would be a way to provide the decoder with information about the upcoming video, and use its previous predictions to enforce the integrity of future predictions of movement. This idea of re-encoding the context was also presented by Johannes Baptist from the University of Amsterdam in this paper, though I am an undergraduate student and still new to this field, so the paper was a bit challenging to grasp.

If anybody has any intuition on what re-encoding the context at every step of decoding will do to an encoder-decoder model, and can provide any insight or resources to continue looking into, that would be great, thank you very much.

0 Answers0