4

I was reading the deepmind nature paper on DQN network. I almost got everything about it except one. I don't know why no one has asked this question before but it seems a little odd to me anyway.

My question: Input to DQN is a 84*84*4 image. The first convolution layer consists of 32 filters of 8*8 with stide 4. I want to know what is the result of this convolution phase exactly? I mean, the input is 3D, but we have 32 filters which are all 2D. How does the third dimension (which corresponds to 4 last frames in the game) take part in the convolution?

Any ideas? Thanks Amin

donamin
  • 43
  • 2

1 Answers1

3

You can think of the third dimension (representing the last four frames) as channels into the network.

A similar scenario occurs if you combine three channels of RGB to create a greyscale representation. In this case you perform each convolution (for each channel) separately and sum the contributions to give the final output feature map.

The DeepMind guys refer to this paper (What is the Best Multi-Stage Architecture for Object Recognition?) which may provide a better explanation.

John Wakefield
  • 477
  • 1
  • 4
  • 15