Python: how to save training datasets

Question

I have got training datasets, which are xtrain, ytrain, xtest and ytest. They are all numpy arrays. I want to save them together into a file, so that I can load them into workspace as done in keras for mnist.load_data:

(xtrain, ytrain), (xtest, ytest) = mnist.load_data(filepath)

In python, is there any way to save my training datasets into such a single file? Or is there any other appreciate methods to save them?

Pickle allows you to to dump any object to a .dat file and reload it in a file. Note there may be better ways to do this so check the answer. — syntaxError, Jun 09 '17 at 21:38

score 4 · Answer 1 · answered Jun 09 '17 at 21:37

4

You have a number of options:

Keras provides option to save models to hdf5. Also, note that out of the three, it's the only interoperable format.

answered Jun 09 '17 at 21:37

Lukasz Tracewski

10,794
3
34
53

I am not going to save models, only save training data that will be reused later. Something special here is to save all them together in to one file. – jwm Jun 09 '17 at 21:42
That doesn't matter, I mentioned this so that you know what library is using. You can save your training and test data to a single hdf5 file, in a nicely structured fashion. – Lukasz Tracewski Jun 09 '17 at 21:44
I tried to save them into a tuple using h5py, which produces error message: TypeError: Object dtype dtype('O') has no native HDF5 equivalent. May it does not support tuple format. Do you have any suggestions about structuring data together (xtrain and xtest are not the same in dimensions). Thanks! – jwm Jun 09 '17 at 23:06

score 3 · Answer 2 · answered Jun 09 '17 at 21:47

3

Pickle is a good way to go:

import pickle as pkl

#to save it
with open("train.pkl", "w") as f:
    pkl.dump([train_x, train_y], f)

#to load it
with open("train.pkl", "r") as f:
    train_x, train_y = pkl.load(f)

If your dataset is huge, I would recommend check out hdf5 as @Lukasz Tracewski mentioned.

answered Jun 09 '17 at 21:47

Bubble Bubble Bubble Gut

3,280
12
28

4

Since my edit got rejected: `wb` and `rb` are a must for pickle, see [Using pickle.dump - TypeError: must be str, not bytes](https://stackoverflow.com/a/13906715/11154841). – questionto42 Sep 18 '21 at 10:44

score 2 · Answer 3 · answered Jun 09 '17 at 23:16

2

I find hickle is a very nice way to save them all together into a dict:

import hickle as hkl
data = {'xtrain': xtrain, 'xtest': xtest,'ytrain': ytrain,'ytest':ytest}
hkl.dump(data,'data.hkl')

answered Jun 09 '17 at 23:16

jwm

4,832
10
46
78

score 1 · Answer 4 · answered Jun 09 '17 at 21:37

1

You simply could use numpy.save

np.save('xtrain.npy', xtrain)

or in a human readable format

np.savetxt('xtrain.txt', xtrain)

answered Jun 09 '17 at 21:37

petezurich

9,280
9
43
57

1

my training datasets are quite large in size. I want to save them more economically in memory – jwm Jun 09 '17 at 21:40

Python: how to save training datasets

4 Answers4