Do you have any code example or paper that refers to something like the following diagram?
I want to know why we want to stack multiple resnet blocks as opposed to multiple convolutional block as in more traditional architectures? Any code sample or referring to one will be really helpful.
Also, how can I transfer that to something like the following that can contain self-attention module for each resnet block?