how to create train - dev - test sets from a given dataset in sequence models

Question

Assume that we have the following dataset, where 's' stands for 'step'.

f1  f2  f3  f4  target
1   2   3   4     5
2   3   4   5     6
3   4   5   6     7
4   5   6   7     8
5   6   7   8     9

The model consists of 4 (time) steps. And it gives a single number as output (target). In the very first sample, the step1 input is 1, step2 input is 2, step3 input is 3, and step4 input is 4. And we will train a Sequence model (with RNN, LSTM, or whatever) which will then output "5" for this particular sequence. And the logic is the same in other samples as well.

I am concerned about how to divide such a dataset into train and dev sets. (Just ignore the test set for the time being.)

Alternative 1: Say that the first 3 samples make the train set and the following 2 samples make the dev set, as illustrated below.

Train set:

f1  f2  f3  f4  target
1   2   3   4     5
2   3   4   5     6
3   4   5   6     7

Dev set:

f1  f2  f3  f4  target
4   5   6   7     8
5   6   7   8     9

My concern is: If you look at the last train set sample ([3, 4, 5, 6], 7) and the first dev set sample ([4, 5, 6, 7], 8), you will see that 3 input steps are identical. (And there is a similar problem even with the other dev set sample.)

Q1: Is that a problem that some input steps are identical? Or can we say that it should not matter just because (1) even if input steps are identical, they are used in different steps of the sequence and (2) target values are still different for each sequence example.

Q2: Wrt the problem above, how should the testset be created?

score 1 · Accepted Answer · answered Jul 04 '19 at 13:57

1

Yes, it doesn't matter because they are in different time steps, And the sequences are not identical. They have different targets also. So your model should definitely learn to predict the next character if you train your model well.

answered Jul 04 '19 at 13:57

Dulmina

423
3
15

Thank you, @Dulmina! What if targets were the same? Would you still say that it is safe bcz identical inputs are used in different time steps? Also, would you be able to elaborate your answer with some mathematical explanation as well? It would be very helpful. – edn Jul 04 '19 at 17:12
It depends on your task. If it is okay to have 2 sequences with the same target for according to your task then it is fine. But according to your task if 2 sequences cannot have the same target and you have that kind of data in your dataset then it is incorrect data. Then the wrong data will mislead the learning of the model. Also, the answer cannot elaborate using math equations. if you want you can refer how LSTM works in math details. :-) – Dulmina Jul 04 '19 at 17:34

how to create train - dev - test sets from a given dataset in sequence models

1 Answers1