3

I have a data which include dates in sorted order.

I would like to split the given data to train and test set. However, I must to split the data in a way that the test have to be newer than the train set.

Please look at the given example:

Let's assume that we have data by dates:

1, 2, 3, ..., n.

The numbers from 1 to n represents the days.

I would like to split it to 20% from the data to be train set and 80% of the data to be test set.

Good results:

1) train set = 1, 2, 3, ..., 20

   test set = 21, ..., 100


2) train set = 101, 102, ... 120

    test set = 121, ... 200

My code:

train_size = 0.2
train_dataframe, test_dataframe = cross_validation.train_test_split(features_dataframe, train_size=train_size)                          

train_dataframe = train_dataframe.sort(["date"])
test_dataframe = test_dataframe.sort(["date"])

Does not work for me!

Any suggestions?

Aviade
  • 2,057
  • 4
  • 27
  • 49

1 Answers1

5

If you insist that all testing data be newer than all training data, then there is only one way to accomplish the desired 20/80 split.

n = features_dataframe.shape[0]
train_size = 0.2

features_dataframe = features_dataframe.sort_values('date')
train_dataframe = features_dataframe.iloc[:int(n * train_size)]
test_dataframe = features_dataframe.iloc[int(n * train_size):]

And there is nothing random about it.

piRSquared
  • 285,575
  • 57
  • 475
  • 624