12

I think the title is self explanatory but to ask it in details, there's sklearn's method train_test_split() which works like: X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.3, stratify = Y) It means: the method will split data with 0.3 : 0.7 ratio and will try to make percentage of labels in both data equal. Is there a keras equivalent of this?

CerushDope
  • 445
  • 1
  • 5
  • 14
  • 1
    There is no separate method, but you can use the `validation_split` keyword for the fit function to split the input data. Still the split is naive and will not try to balance the labels. – Abhai Kollara Feb 01 '18 at 16:01
  • 1
    No, validation split makes a cross validation, a.k.a during training it just uses validation data to validate the model learned on current backprop of training data. I don't want it, I just want to have separate test data which will be used only after the model is already ready. – CerushDope Feb 01 '18 at 16:05
  • 2
    There is no method, just use the one in scikit-learn. – Dr. Snoopy Feb 01 '18 at 17:30

2 Answers2

5

Now there is using the keras Dataset class. I'm running keras-2.2.4-tf along with the new tensorflow release.

Basically, load all the data into a Dataset using something like tf.data.Dataset.from_tensor_slices. Then split the data into new datasets for training and validation. For example, shuffle all the records in the dataset. Then use all but the first 400 as training and the first 400 as validation.

ds = ds_in.shuffle(buffer_size=rec_count)
ds_train = ds.skip(400)
ds_validate = ds.take(400)

An instance of the Dataset class is a natural container to pass around for the Keras models. I copied the concept from a tensorflow or keras training example but can't seem to find it again.

The canned datasets using the load_data method create numpy.ndarray classes so they are a little different but can be easily converted to a keras Dataset. I suspect this hasn't been done because so much existing code would break.

dturvene
  • 2,284
  • 1
  • 20
  • 18
1

Unfortunately, the answer (despite our wish) is No! There are some existing datasets like MNIST etc. which can be directly loaded:

(X_train, y_train), (X_test, y_test) = mnist.load_data()

This direct loading in a splitted way makes one have a false hope to have a general method, but unfortunately that isn't present here, though you may would be interested in using the wrappers for SciKit-Learn on Keras.

There is almost similar question on DataScience SE

Failed Scientist
  • 1,977
  • 3
  • 29
  • 48