Convolution for state representation

Question

When using DQN, other deep RL algorithms, does it make sense to use convolutional layer in the actor or critic network when you have a state input?

Let's say: state representation 1: (obj label, position, velocity) of each object in the environment

state representation 2: There is a tile-based/gridworld style game. We have a 2D grid of numbers describing each object type (1=apple, 2=dog, 3=agent, etc.). We flatten this grid and pass it in as the state to our RL algorithm.

In either case, does it make sense to use a conv layer? Why or why not?

for future reference: questions like these (not directly about implementation/programming) may be more suitable on the https://ai.stackexchange.com/ site — Dennis Soemers, Jan 04 '19 at 15:40

score 2 · Answer 1 · answered Jan 04 '19 at 15:40

Convolutional layers basically encode the intuition of ""location invariance", the idea that we expect detection of certain "features" ("things", edges, corners, circles, noses, faces, whatevers) to work in roughly the same way regardless of "where" (typically in a 2D space, but could theoretically also be in some other kind of space) they are. This intuition is implemented by having "filters" or "feature detectors" that "slide" along some space.

Let's say: state representation 1: (obj label, position, velocity) of each object in the environment

In this case, the intuition described above does not make sense. The input is not some kind of "space" where we expect to be able to detect similar "shapes" in different locations. A convolutional layer likely would perform poorly here.

state representation 2: There is a tile-based/gridworld style game. We have a 2D grid of numbers describing each object type (1=apple, 2=dog, 3=agent, etc.). We flatten this grid and pass it in as the state to our RL algorithm.

With the 2D grid representation, the intuition encoded by convolutional layers may make sense. For example, to detect useful patterns like dogs being adjacent to, or surrounded by, apples. However, in this case you wouldn't want to flatten the grid; just pass the entire 2D grid as input into whatever framework you're using to implement convolutional layers: it might do some flattening internally, but for the whole concept of convolutional layers the original, unflattened dimensions are highly relevant and important. Encoding of categorical variables as numbers 1, 2, 3, etc. also doesn't tend to work well with Neural Networks. A one-hot encoding (with channels for convolutional layers, one channel per object type) would work better. Just like coloured images tend to have multiple 2D grids (typically a 2D grid for Red, another for Green, and another for Blue in the case of RGB images), you'd want one full grid per object type.

Convolution for state representation

1 Answers1