What is compared when a CNN learns a set of features during backpropagation?

Question

I am relatively new the subject and have been doing loads of reading. What I am particularly confused about is how a CNN learns its filters for a particular labeled feature in a training data set.

Is the cost calculated by which outputs should or shouldn't be active on a pixel by pixel basis? And if that is the case, how does mapping the activations to the labeled data work after having down sampled?

I apologize for any poor assumptions or general misunderstandings. Again, I am new to this field and would appreciate all feedback.

Welcome to StackOverflow. Please follow the posting guidelines in the help documentation, as suggested when you created this account. [On topic](https://stackoverflow.com/help/on-topic), [how to ask](https://stackoverflow.com/help/how-to-ask), and ... [the perfect question](https://codeblog.jonskeet.uk/2010/08/29/writing-the-perfect-question/) apply here. StackOverflow is not a design, coding, research, or tutorial resource. There are many sites that can walk you through the various parts of this process; asking us to summarize is out of scope for Stack Overflow. — Prune, Jul 26 '19 at 18:24
Thanks for the feedback. Reading through my post now, I still fail to see how I am asking for a summary of anything. I asked a pointed question that should have a pointed answer. If you are aware of existing documentation that answers my question, please share! — CuriousOne, Jul 26 '19 at 18:32
@Prune I have a strong feeling, given your background, that you could concisely answer my question. All I really want to know is if weights are influenced on a pixel by pixel basis or by features as a whole (however that process would work). — CuriousOne, Jul 26 '19 at 19:02
Great; then let's try it that way. I'll give an answer that I think *does* fit Stack Overflow, and we'll see whether it answers what's in your head. Give me an hour or two ... — Prune, Jul 26 '19 at 19:04

score 0 · Accepted Answer · answered Jul 26 '19 at 20:02

I'll break this up into a few small pieces.

Cost calculation -- cost / error / loss depends only on comparing the final prediction (the last layer's output) to the label (ground truth). This serves as a metric of how right or wrong the prediction is.
Inter-layer structure -- Each input to the prediction is an output of the prior layer. This output has a value; the link between the two has a weight.
Back-prop -- Each weight gets adjusted in proportion to the error comparison and its weight. A connection that contributed to a correct prediction gets rewarded: its weight is increased in magnitude. Conversely, a connection that pushed for a wrong prediction gets reduced.
Pixel-level control -- To clarify the terminology ... traditionally, each kernel is a square matrix of float values, each of which is called a "pixel". The pixels are trained individually. However, that training comes from sliding a smaller filter (also square) across the kernel, performing a dot-product of the window with the corresponding square sub-matrix of the kernel. The output of that dot-product is the value of a single pixel in the next layer.
When the strength of pixel in layer N is increased, this effectively increases the influence of the filter in layer N-1 providing that input. That filter's pixels are, in turn, tuned by the inputs from layer N-2.

What is compared when a CNN learns a set of features during backpropagation?

1 Answers1