Pandas - retrieving HDF5 columns and memory usage

Question

I have a simple question, I cannot help but feel like I am missing something obvious.

I have read data from a source table (SQL Server) and have created an HDF5 file to store the data via the following:

output.to_hdf('h5name', 'df', format='table', data_columns=True, append=True, complib='blosc', min_itemsize = 10)

The dataset is ~50 million rows and 11 columns.

If I read the entire HDF5 back into a dataframe (through HDFStore.select or read_hdf), it consumes about ~24GB of RAM. If I parse specific columns into the read statements (e.g. selecting 2 or 3 columns), the dataframe now only returns those columns, however the same amount of memory is consumed (24GB).

This is running on Python 2.7 with Pandas 0.14.

Am I missing something obvious?

EDIT: I think I answered my own question. While I did a ton of searching before posting, obviously once posted I found a useful link: https://github.com/pydata/pandas/issues/6379

Any suggestions on how to optimize this process would be great, due to memory limitations I cannot hit peak memory required to release via gc.

"Python 2.4" you should definitely consider updating, this is not supported (or do you mean 3.4??). — Andy Hayden, Sep 18 '14 at 00:40

Jeff · Accepted Answer · 2014-09-18T13:16:52.160

HDFStore in table format is a row oriented store. When selecting the query indexes on the rows, but for each row you get every column. selecting a subset of columns does a reindex at the end.

There are several ways to approach this:

use a column store, like bcolz; this is currently not implemented by PyTables so this would involve quite a bit of work
chunk thru the table, see here and concat at the end - this will use constant memory
store as a fixed format - this is a more efficient storage format so will use less memory (but cannot be appended)
create your own column store-like by storing to multiple sub tables and use select_as_multiple see here

which options you choose depend on the nature of your data access

note: you may not want to have all of the columns as data_columns unless you are really going to select from the all (you can only query ON a data_column or an index) this will make store/query faster

Pandas - retrieving HDF5 columns and memory usage

1 Answers1