0

I have a question regarding the choice of the training and the test set for a Multilayer Perceptron (MLP) and a Hopfield network.

For example, assume that we got 100 patterns of the digits 0-9 given in a bitmap format. 10 of them are perfect digits while the other 90 are distorted. Which of these patterns will be used for the training set and which for the test set? The goal is to classify the digits.

I suppose for the Hopfield network the perfect digits will be used as the training set, but what about the MLP? One approach I thought of was to take for example 70 of the distorted digits and use them as the training set along with the corresponding perfect digits as their intended targets. Is this approach correct?

sarotnem
  • 349
  • 4
  • 14

1 Answers1

0

Disclaimer: I have not worked with Hopfield Networks before, so I trust you in your statements about it, but it should not be of that great relevance for the answer, anyways.
I am also assuming that you want to classify the digits, which is something you don't explicitly state in your question.

As for a proper split: Aside from the fact that that little training data is generally not a feasible amount to get decent results for a MLP (even for a simple task such as digit classification), it is unlikely that you will be able to "pre-label" your training data in terms of quality in most real-world scenarios. You should therefore always assume that the data you are processing is inherently noisy. A good example for this is also the fact that data augmentation is frequently used to enrich your training corpus. Since data augmentation can consist of such simple changes as

  • added noise
  • minor rotations
  • horizontal/vertical flipping (the latter only makes so much sense for digits, though)

can improve your accuracy, it goes to show that visual quality and quantity for training are two very different things. Of course, it is not per se true that quantity alone will solve your problem (although research indicates that it is at least a good idea to use very much data)

Further, what you judge to be a good representation might be very much different from the network's perspective (although for labeling digits it might be rather easy to tell). A decent strategy is therefore to simply perform a random sampling for your training/test split.

Something I like to do when preprocessing a dataset is, when done splitting, to check whether every class is somewhat evenly represented in the splits, so you won't overfit. Similarly, I would argue that having clean/high quality images of digits in both your test and training set might make the most sense, since you want to both be able to recognize a high quality number, as well as a sloppily written digit, and then test whether you can actually recognize it (with your test set).

dennlinger
  • 9,890
  • 1
  • 42
  • 63
  • The question is on a theoretical level and not a real-word implementation, so the quantity of the training data is just an example. Also I've edited my question to specify the goal which is the classification of the digits. – sarotnem Aug 20 '18 at 12:27
  • For theoretical questions, I would always recommend to look more in the direction of [Stats Exchange](https://stats.stackexchange.com). Although, in this specific example, even with your theoretical standpoint, my argumentation holds: Sampling perfect examples both into test *and* training would be the way to go. I don't see any reason why you would neglect the chance to train on them, and obviously you also want to leave room for imperfections (so not using all the perfect examples in the train set might be a good idea). – dennlinger Aug 20 '18 at 12:32