Convolutional layers basically encode the intuition of ""location invariance", the idea that we expect detection of certain "features" ("things", edges, corners, circles, noses, faces, whatevers) to work in roughly the same way regardless of "where" (typically in a 2D space, but could theoretically also be in some other kind of space) they are. This intuition is implemented by having "filters" or "feature detectors" that "slide" along some space.
Let's say: state representation 1: (obj label, position, velocity) of each object in the environment
In this case, the intuition described above does not make sense. The input is not some kind of "space" where we expect to be able to detect similar "shapes" in different locations. A convolutional layer likely would perform poorly here.
state representation 2: There is a tile-based/gridworld style game. We have a 2D grid of numbers describing each object type (1=apple, 2=dog, 3=agent, etc.). We flatten this grid and pass it in as the state to our RL algorithm.
With the 2D grid representation, the intuition encoded by convolutional layers may make sense. For example, to detect useful patterns like dogs being adjacent to, or surrounded by, apples. However, in this case you wouldn't want to flatten the grid; just pass the entire 2D grid as input into whatever framework you're using to implement convolutional layers: it might do some flattening internally, but for the whole concept of convolutional layers the original, unflattened dimensions are highly relevant and important. Encoding of categorical variables as numbers 1, 2, 3, etc. also doesn't tend to work well with Neural Networks. A one-hot encoding (with channels for convolutional layers, one channel per object type) would work better. Just like coloured images tend to have multiple 2D grids (typically a 2D grid for Red, another for Green, and another for Blue in the case of RGB images), you'd want one full grid per object type.