4

I have got training datasets, which are xtrain, ytrain, xtest and ytest. They are all numpy arrays. I want to save them together into a file, so that I can load them into workspace as done in keras for mnist.load_data:

(xtrain, ytrain), (xtest, ytest) = mnist.load_data(filepath)

In python, is there any way to save my training datasets into such a single file? Or is there any other appreciate methods to save them?

jwm
  • 4,832
  • 10
  • 46
  • 78

4 Answers4

4

You have a number of options:

Keras provides option to save models to hdf5. Also, note that out of the three, it's the only interoperable format.

Lukasz Tracewski
  • 10,794
  • 3
  • 34
  • 53
  • I am not going to save models, only save training data that will be reused later. Something special here is to save all them together in to one file. – jwm Jun 09 '17 at 21:42
  • That doesn't matter, I mentioned this so that you know what library is using. You can save your training and test data to a single hdf5 file, in a nicely structured fashion. – Lukasz Tracewski Jun 09 '17 at 21:44
  • I tried to save them into a tuple using h5py, which produces error message: TypeError: Object dtype dtype('O') has no native HDF5 equivalent. May it does not support tuple format. Do you have any suggestions about structuring data together (xtrain and xtest are not the same in dimensions). Thanks! – jwm Jun 09 '17 at 23:06
3

Pickle is a good way to go:

import pickle as pkl

#to save it
with open("train.pkl", "w") as f:
    pkl.dump([train_x, train_y], f)

#to load it
with open("train.pkl", "r") as f:
    train_x, train_y = pkl.load(f)

If your dataset is huge, I would recommend check out hdf5 as @Lukasz Tracewski mentioned.

  • 4
    Since my edit got rejected: `wb` and `rb` are a must for pickle, see [Using pickle.dump - TypeError: must be str, not bytes](https://stackoverflow.com/a/13906715/11154841). – questionto42 Sep 18 '21 at 10:44
2

I find hickle is a very nice way to save them all together into a dict:

import hickle as hkl
data = {'xtrain': xtrain, 'xtest': xtest,'ytrain': ytrain,'ytest':ytest}
hkl.dump(data,'data.hkl')
jwm
  • 4,832
  • 10
  • 46
  • 78
1

You simply could use numpy.save

np.save('xtrain.npy', xtrain)

or in a human readable format

np.savetxt('xtrain.txt', xtrain)

petezurich
  • 9,280
  • 9
  • 43
  • 57
  • 1
    my training datasets are quite large in size. I want to save them more economically in memory – jwm Jun 09 '17 at 21:40