0

I have some variables, which include dictionaries, list of list, and numpy arrays. I save all of them with the following code, where obj=[var1,var2,...,varn]. The variables size is small enough to be loaded in memory.

My problem is when I save the corresponding variables in matlab the output file takes a lot less space on the disk than doing it in python. Similarly, loading the variables from the disk takes a lot more time to be loaded in memory in python than matlab.

with open(filename, 'wb') as output:
    pickle.dump(obj, output, pickle.HIGHEST_PROTOCOL)

Thanks

user590028
  • 11,364
  • 3
  • 40
  • 57
  • 1
    pickle is not an optimized disk format. It is meant to be a complete representation. If space is a big issue, you could either compress the pickled results or create your own file format. – user590028 Sep 07 '14 at 16:49
  • For quicker pickling you could always give [`cPickle`](https://docs.python.org/2/library/pickle.html#module-cPickle) a shot! – Alex Riley Sep 07 '14 at 17:37
  • `scipy.io.savemat` saves arrays (including sparse ones) in a MATLAB compatible format (4 & 5 versions). – hpaulj Sep 09 '14 at 02:52

3 Answers3

2

Try this:

To save to disk

import gzip
gz = gzip.open(filename + '.gz', 'wb')
gz.write(pickle.dumps(obj, pickle.HIGHEST_PROTOCOL))
gz.close()

To load from disk

import gzip
gz = gzip.open(filename + '.gz', 'rb')
obj = pickle.loads(gz.read())
gz.close()
user590028
  • 11,364
  • 3
  • 40
  • 57
  • What about the loading? Should it be like lists=[] gz = gzip.open(filename + '.gz', 'rb') lists.append(gz.unzip(pickle.load(filename))) – Elrond Gimli Sep 07 '14 at 17:01
1

Matlab uses HDF5 and compression to save mat-Files; HDF5 is a format to access large amount of data very fast. Python-pickle safes information to recreate the objects, it's not optimized for speed and size but flexibility. If you like, use HDF5 for python.

Daniel
  • 42,087
  • 4
  • 55
  • 81
  • So, what is the best way to save/load the above obj using HDF5? f = h5py.File(filename, "w",chunks=True) f['var1']=obj[0], ..., f.close() – Elrond Gimli Sep 07 '14 at 16:54
0

Well, the issue is with pickle not Python per se. As others have mentioned, .mat files saved in version 7.3 or higher, use HDF5 format. HDF5 is optimized to efficiently store and retrieve large datasets; Pickle handles data differently. You can replicate or even surpass the performance of Matlab's save function by using the h5py or netcf4 Python modules; NetCDF is a subset of HDF5. For example, using HDF5, you may do:

import h5py
import numpy as np

f = h5py.File('test.hdf5','w')
a = np.arange(10)
dset = f.create_dataset("init", data=a)
f.close()

I'm not sure if doing the equivalent in MATLAB will result in a file of exactly the same size but it should be close. You can play around to with the HDF5's compression features to get the results you want.

Edit 1:

To load an HDF5 file, such as .mat file, you could do something like M2 = h5py.File('file.mat'). M2 is a HDF5 group, which is kinda like a python dictionary. Doing M2.keys() gives you the variable names. If one of the variables is an array called "data", you can read it out by doing data = M2["data"][:].

Edit 2:

To save multiple variables, you can create multiple datasets. The basic syntax is f.create_dataset("variable_name", data=variable). See link for more options. For e.g.

import h5py
import numpy as np

f = h5py.File('test.hdf5','w')

data1 = np.ones((4,4))
data2 = 2*data1
f.create_dataset("ones", data=data1)
f.create_dataset("twos", data=data2)

f is both a file object and a HDF5 group. So doing f.keys() gives:

[u'ones', u'twos']

To view what's stored under the 'ones' key, you would do:

f['ones'][:]

array([[ 1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.]])

You can create as many datasets as you would like. When you're done writing files, close the file object: f.close().

I should add that my approach here only works for array-like datasets. You can save other Python objects, such as lists and dictionaries, but doing so requires a bit more work. I only resort to HDF5 for large numpy arrays. For everything else, pickle works just fine for me.

Community
  • 1
  • 1
nikobam
  • 128
  • 1
  • 6
  • How could I save data=[var1,var2,...,varn] using HDF5? Moreover, how could I load them via HDF5? – Elrond Gimli Sep 09 '14 at 05:21
  • There are number of ways to save data using HDF5. The basic way is to create a [HDF5 dataset](http://docs.h5py.org/en/latest/high/dataset.html#hdf5-datasets) – nikobam Sep 09 '14 at 20:36
  • To expand on my previous comment.... To load an HDF5 file, such as .mat file, you could do something like `M2 = h5py.File('file.mat')`. M2 is a HDF5 group, which is kinda like a python dictionary. Doing `M2.keys()` gives you the variable names. If one of the variables is an array called "data", you can read it out by doing `data = M2["data][:].` – nikobam Sep 09 '14 at 20:43
  • Thanks! I think the above should be data = M2['data'][:]? – Elrond Gimli Sep 09 '14 at 22:27
  • So, how do we save multiple variables like data1, data2,... in this case? – Elrond Gimli Sep 09 '14 at 22:28
  • Yup that should have been `data = M2['data'][:]`. Sorry about that. See the edits to my original post in response to your questions. – nikobam Sep 09 '14 at 23:15