Any visualizations of neural network decision process when recognizing images?

Question

I'm enrolled in Coursera ML class and I just started learning about neural networks.

One thing that truly mystifies me is how recognizing something so “human”, like a handwritten digit, becomes easy once you find the good weights for linear combinations.

It is even crazier when you understand that something seemingly abstract (like a car) can be recognized just by finding some really good parameters for linear combinations, and combining them, and feeding them to each other.

Combinations of linear combinations are much more expressible than I once thought.
This lead me to wonder if it is possible to visualize NN's decision process, at least in simple cases.

For example, if my input is 20x20 greyscale image (i.e. total 400 features) and the output is one of 10 classes corresponding to recognized digits, I would love to see some kind of visual explanation of which cascades of linear combinations led the NN to its conclusion.

enter image description here

I naïvely imagine that this may be implemented as visual cue over the image being recognized, maybe a temperature map showing “pixels that affected the decision the most”, or anything that helps to understand how neural network worked in a particular case.

Is there some neural network demo that does just that?

Assuming you mean multi-layer feedforward networks, those are *not* just linear models. However, you might get a hint of which pixels are most important by calculating the sum of absolute weights connecting any of them to each of the hidden units. — Fred Foo, May 29 '12 at 09:51
look into the thetas, these are the weights you're looking for. You can also visualize the hidden layer to look what your network processed. — Thomas Jungblut, Jun 22 '12 at 09:09

score 2 · Answer 1 · answered May 29 '12 at 08:59

This is not a direct answer to your question. I would suggest you take a look at convolutional neural networks (CNN). In CNNs you can almost see the concept that is learned. You should read this publication:

Y. LeCun, L. Bottou, Y. Bengio and P. Haffner: Gradient-Based Learning Applied to Document Recognition, Proceedings of the IEEE, 86(11):2278-2324, November 1998

CNNs are often called "trainable feature extractors". In fact, CNNs implement 2D filters with trainable coefficients. This is why the activation of the first layers are usually shown as 2D images (see Fig. 13). In this paper the authors use another trick to make the networks even more transparant: the last layer is a radial basis function layer (with gaussian functions), i. e. the distance to an (adjustable) prototype for each class is calculated. You can really see the learned concepts by looking at the parameters of the last layer (see Fig. 3).

However, CNNs are artificial neural networks. But the layers are not fully connected and some neurons share the same weights.

score 2 · Accepted Answer · edited Jun 20 '20 at 09:12

Maybe it doesn't answer the question directly but I found this interesting piece in this Andrew Ng, Jeff Dean, Quoc Le, Marc’Aurelio Ranzato, Rajat Monga, Matthieu Devin, Kai Chen and Greg Corrado paper (emphasis mine):

In this section, we will present two visualization techniques to verify if the optimal stimulus of the neuron is indeed a face. The first method is visualizing the most responsive stimuli in the test set. Since the test set is large, this method can reliably detect near optimal stimuli of the tested neuron. The second approach is to perform numerical optimization to find the optimal stimulus

...

These visualization methods have complementary strengths and weaknesses. For instance, visualizing the most responsive stimuli may suffer from fitting to noise. On the other hand, the numerical optimization approach can be susceptible to local minima. Results, shown [below], confirm that the tested neuron indeed learns the concept of faces.

enter image description here

In other words, they take a neuron that is best-performing at recognizing faces and

select images from the dataset that it cause it to output highest confidence;
mathematically find an image (not in dataset) that would get highest condifence.

It's fun to see that it actually “captures” features of the human face.
The learning is unsupervised, i.e. input data didn't say whether an image is a face or not.

Interestingly, here are generated “optimal input” images for cat heads and human bodies:

enter image description here

Any visualizations of neural network decision process when recognizing images?

2 Answers2