I have a simple question, I cannot help but feel like I am missing something obvious.
I have read data from a source table (SQL Server) and have created an HDF5 file to store the data via the following:
output.to_hdf('h5name', 'df', format='table', data_columns=True, append=True, complib='blosc', min_itemsize = 10)
The dataset is ~50 million rows and 11 columns.
If I read the entire HDF5 back into a dataframe (through HDFStore.select or read_hdf), it consumes about ~24GB of RAM. If I parse specific columns into the read statements (e.g. selecting 2 or 3 columns), the dataframe now only returns those columns, however the same amount of memory is consumed (24GB).
This is running on Python 2.7 with Pandas 0.14.
Am I missing something obvious?
EDIT: I think I answered my own question. While I did a ton of searching before posting, obviously once posted I found a useful link: https://github.com/pydata/pandas/issues/6379
Any suggestions on how to optimize this process would be great, due to memory limitations I cannot hit peak memory required to release via gc.