11

For a few aspects of a project, using "h5" storage would be ideal. However, the files are becoming massive and frankly we're running out of space.

This statement...

 store.put(storekey, data, table=False, compression='gzip')

does not produce any difference in terms of file size than...

 store.put(storekey, data, table=False)

Is using compression even possible when going through Pandas?

... if it isn't possible, I don't mind using h5py, however, I'm uncertain what to put for a "datatype" as the DataFrame contains all sorts of types (strings, float, int etc.)

Any help/insight would be appreciated!

TravisVOX
  • 20,342
  • 13
  • 37
  • 41

3 Answers3

11

see docs in regards to compression using HDFStore

gzip is not a valid compression option (and is ignored, that's a bug). try any of zlib, bzip2, lzo, blosc (bzip2/lzo might need extra libraries installed)

see for PyTables docs on the various compression

Heres a question semi-related.

Community
  • 1
  • 1
Jeff
  • 125,376
  • 21
  • 220
  • 187
  • When I attempt to implement the code via the docs (table=True/False... all combinations), I get the following error: `ValueError: Compression not supported on non-table` Is my DataFrame (which does contain some string), not compatible with this type of storage? – TravisVOX Aug 16 '13 at 14:58
  • 1
    try opening the store with the ``complib='zlib',complevel=9``, the first time your write it; Tables support per table compression, but ``storers`` (a non-table) do not (because of their implementation, they don't use a compression format under the hood) – Jeff Aug 16 '13 at 15:02
  • as an aside, if you do have lots of data, ``table`` format prob better for you, as you can ``append``, e.g. do chunked reads and writes (and queries); ``storer`` cannot – Jeff Aug 16 '13 at 15:07
  • Okay, I've followed the advice and am running into this error: `TypeError: Cannot serialize the column [name] because its data contents are [unicode] object dtype` – TravisVOX Aug 16 '13 at 16:38
  • what's your python version and tables version? – Jeff Aug 16 '13 at 16:52
  • I'm using Anaconda, fresh install as of last week. Should be pytables 2.4.0 and python 2.7.5. – TravisVOX Aug 16 '13 at 16:55
  • can't serialize unicode using a ``table`` in py2, nor with 2.4.0; this would work in py3 using 3.0.0. A ``storer`` might work, but not sure. Do you actually have ``unicode``, e.g. characters that require it? – Jeff Aug 16 '13 at 16:59
  • py2 doesn't support unicode writing directly in a table, the primitive is not support; The ``storer`` can handle it, but it is pickled, so would not be very efficient. If you REALLY need actual unicode support (and not just string support), then should use PyTables 3.0.0 and Py3 – Jeff Aug 16 '13 at 17:11
1

I've ben quite a fan of HDF5 in the past, but having hit a variety of complications, especially with Pandas HDFStore, I'm starting to think Exdir is a good idea.

http://exdir.readthedocs.io

1

You can write you data in a zipped format like this:

import pandas as pd

some_key = 'some_key'

with pd.HDFStore('path/to/your/h5/file.h5', complevel=9, complib='zlib') as store:
    store[some_key] = your_data_to_save_in_the_key

And you can read it back:

with pd.HDFStore('path/to/your/h5/file.h5', complevel=9, complib='zlib') as store:
    data_retrieved = store[some_key]