It is no wonder that each of the separate networks yields better performance on the according training set it has been trained on. But these prediction error values are misleading, because it is an ill-posed problem to minimize the error on a training set. Your ultimate goal is to maximize the generalization performance of your model, so it performs well on new data it has not seen during training. Imagine a network which just memorizes each of the characters and thus functions more like a hashtable. Such a network would yield 0 errors on the training data but would perform badly on other data.
One way to measure generalization performance is to extract a fraction (e.g. 10%) of your available data and to use it as a test set. You do not use this test set during training, only for measurement.
Further, you should check the topology of your network. How many hidden layers and how many neurons per hidden layer do you use? Make sure your topology is large enough so it can tackle the complexity of your problem.
Also have a look at other techniques to improve generalization performance of your network, like L1 regularization (subtracting a small fixed amount of the absolute value of your weights after each training step), L2 regularization (subtracting a small percentage of your weights after each training step) or Dropout (randomly turning off hidden units during training and halving the weight vector as soon as training is finished). Further, you should consider more efficient training algorithms like RPROP- or RMSProp rather than plain backpropagation (see Geoffrey Hinton's coursera course on neural networks). You should also consider the MNIST dataset containing written numbers 0-9 for testing your setup (you should easily achieve less than 300 misclassificaitons on the test set).
To answer your original question on how to omit certain output neurons, you could create an own layer module. Have a look at the SoftmaxLayer, but before applying the softmax activation function, set all output-neurons to 0 which belong to the classes you want to omit. You need to manipulate the outbuf
varable in _forwardImplementation
. If you want to use this during training, make sure to set the error signal to zero for those classes before backpropagating the error to the previous layer (by manipulating _backwardImplementation
). This can be useful e.g. if you have incomplete data and do not want to throw away each sample containing just one NaN value. But in your case you actually do not need this.