I was reading the deepmind nature paper on DQN network. I almost got everything about it except one. I don't know why no one has asked this question before but it seems a little odd to me anyway.
My question: Input to DQN is a 84*84*4 image. The first convolution layer consists of 32 filters of 8*8 with stide 4. I want to know what is the result of this convolution phase exactly? I mean, the input is 3D, but we have 32 filters which are all 2D. How does the third dimension (which corresponds to 4 last frames in the game) take part in the convolution?
Any ideas? Thanks Amin