Difference of "Training Data Set", "Testing Data Set" and "Validation Data set"

Question

I have 250 human face images and with those I am going to train the model. for the sake of convenience, what I am going to do is to pick first 10 images and use leave-one-image-out cross validation to train the model so that each image gets the chance to be the test image. What I understand is that in that case size of my training data set is 9 and size of my testing data set is 1. After that I'm going to get next 10 images and then use them as well to train the model. In that case,size of my training data set would be 19 and testing data set would be 1 (this takes place repeatedly 20 times so that every image gets the chance to be in the testing set ). Likewise, this goes up until I've used all the 250 images to train the model.

What I don't understand is "Validation Data set ". Am I doing it in the wrong way ?

There was one answer on Stackoverflow but it wasn't clear to me. That's why I posted this question

score 1 · Answer 1 · answered Jul 31 '14 at 07:36

1

You should split your data into training, validation and testing sets in the ratio of about 6:2:2. For training your model you using training set. Comparing results on training and validation sets give you information about bias and variance. And finally test set shows how good your model predict. Your model sholdn't see any of your test examples during training.

answered Jul 31 '14 at 07:36

Pavelko

11
3

you mean to say that I shouldn't use all the 250 images for training and testing ? like I have described in my question above ? Is that what you mean by "your model shouldn't see any of your test examples during training " ? – Nishi Jul 31 '14 at 11:48
Yes, after you finished your training there must be new examples for testing model's accuracy. – Pavelko Jul 31 '14 at 15:55

score 0 · Answer 2 · answered Jan 21 '18 at 06:54

The reason is that developing a model always involves tuning its configuration: for example, choosing the number of layers or the size of the layers (called the hyper-parameters of the model, to distinguish them from the parameters, which are the network’s weights). You do this tuning by using as a feedback signal the performance of the model on the validation data. In essence, this tuning is a form of learning: a search for a good configuration in some parameter space.

Splitting your data into training, validation, and test sets may seem straightforward, but there are a few advanced ways to do it that can come in handy when little data is available as in your case with just 25 data points.

You can look into three classic evaluation recipes:

simple hold-out validation
K-fold validation
iterated K-fold validation with shuffling.

Difference of "Training Data Set", "Testing Data Set" and "Validation Data set"

2 Answers2