Based on your questions,
1- Decoder layers are not always mirrored version of the encoder layers. You can check Mask R-CNN, YOLO and similar network architecture to check that decoder is consisted by 1-2 layers while encoder is consisted by multiple layers. However, based on my personal experience, I would definitely suggest to implement mirrored networks and supply it with feedbacks from encoder layers
2- You can use Code (h) part to do multiple things, including description (by supplying it to RNN), classification (by appending DNN), localization (by appending localization network) etc.
Encoder layers are just feature extractors and Code(h) contains extracted features. It is up to you to decide what to do with those features.