I've trained a single layer, 100 hidden unit RBM with binary input units and ReLU activation on the hidden layer. Using a training set of 50k MNIST images, I end up with ~5% RMSE on the 10k image test set after 500 epochs of full-batch training with momentum and L1 weight penalty.
Looking at the visualisation below, it is clear that there are big differences between the hidden units. Some appear to have converged into a very well defined response pattern, while others are indistinguishable from noise.
My question is: how would you interpret this apparent variety, and what technique could possibly help with achieving a more balanced result? Does a situation like this call for more regularization, slower learning, longer learning, or something else?
Raw weights of the 100 hidden units, reshaped into the input image size.