On disk indexing of Pandas multiindexed HDFStore

Question

In order to improve performance and reduce memory footprint, I am trying to read a multi-indexed HDFStore created in Pandas. The original Store is quite large, but the problem can be reproduced with a similar but smaller example.

df = pd.DataFrame([0.25, 0.5, 0.75, 1.0],
                      index=['Item0', 'Item1', 'Item2', 'Item3'], columns=['Values'])

df = pd.concat((df.iloc[:],df.iloc[:]), axis=0,names=['Item','N'], 
               keys = ['Items0','Items1'])

df.to_hdf('hdfs.h5', 'df', format='table',mode='w',complevel= 9,complib='blosc',data_columns=True) 

store = pd.HDFStore('hdfs.h5', mode= 'r')

store.select('df',where='Item="Items0"')

This is expected to be return the values of the sub-index, however it returns an error

> ValueError: The passed where expression: Item="Items0"
>             contains an invalid variable reference
>             all of the variable refrences must be a reference to
>             an axis (e.g. 'index' or 'columns'), or a data_column
>             The currently defined references are: index,iron,columns

The indices are:

store['df'].index

> MultiIndex(levels=[['Items0', 'Items1'], ['Item0', 'Item1', 'Item2',
> 'Item3']],
>            labels=[[0, 0, 0, 0, 1, 1, 1, 1], [0, 1, 2, 3, 0, 1, 2, 3]],
>            names=['Item', 'N'])

Could any one just explain what may be the cause? or how it should be done properly...

score 0 · Accepted Answer · answered Nov 09 '16 at 09:57

0

For me works if remove data_columns=True:

df.to_hdf('hdfs3.h5', 'df', format='table',mode='w',complevel= 9,complib='blosc') 
store = pd.HDFStore('hdfs3.h5', mode= 'r')
print (store.select('df','Item="Items0"'))
              Values
Item   N            
Items0 Item0    0.25
       Item1    0.50
       Item2    0.75
       Item3    1.00

answered Nov 09 '16 at 09:57

jezrael

822,522
95
1,334
1,252

Thanks for your reply.. however its strange, pandas documentation and a lot of references says you have to use 'data_columns=True' – Suraj Nov 09 '16 at 10:13
Yes, I try find it too, but no success. In [docs](http://pandas.pydata.org/pandas-docs/stable/io.html#querying-a-table) is never use with `MultiIndex`, so maybe it is not supported. But I dont know. – jezrael Nov 09 '16 at 10:20
pandas official documentation and a lot of references says you have to use 'data_columns=True', eg [stackoverflow](http://stackoverflow.com/questions/20598973/hdfstore-term-memory-efficient-way-to-check-for-membership-in-list). What may be the difference ?. Thanks anyway.. – Suraj Nov 09 '16 at 10:29
please be aware that all columns beside index names (`Item`, `N`) will NOT be indexed this way and can't be used in the `where` clause. For example this query will not work: `store.select('df',where='N="Item3" and Values in [0.5, 1]')` – MaxU - stand with Ukraine Nov 09 '16 at 11:54
A follow up - anybody looking on this issue please take a look at [this one](http://stackoverflow.com/q/29497694/7135044), especially [answer](http://stackoverflow.com/a/29510954/7135044), hope it help attempts in this regards – Suraj Nov 10 '16 at 11:57

MaxU - stand with Ukraine · Answer 2 · 2016-11-10T06:34:39.883

Try to replace data_columns=True with data_columns=df.columns.tolist().

Demo:

Original MultiIndex DF:

In [2]: df
Out[2]:
              Values
Item   N
Items0 Item0    0.25
       Item1    0.50
       Item2    0.75
       Item3    1.00
Items1 Item0    0.25
       Item1    0.50
       Item2    0.75
       Item3    1.00

save it to HDF5 using data_columns=df.columns.tolist():

In [3]: df.to_hdf('c:/temp/hdfs.h5','df',format='t',mode='w',complevel=9,complib='blosc',data_columns=df.columns.tolist())

In [4]: df.columns.tolist()
Out[4]: ['Values']

selecting from HDF store:

In [5]: store = pd.HDFStore('c:/temp/hdfs.h5')

both index levels and Values column are indexed now and can be used in where=<query> argument:

In [6]: store.select('df',where='Item="Items0" and Values in [0.5, 1]')
Out[6]:
              Values
Item   N
Items0 Item1     0.5
       Item3     1.0

In [7]: store.select('df',where='N="Item3" and Values in [0.5, 1]')
Out[7]:
              Values
Item   N
Items0 Item3     1.0
Items1 Item3     1.0

storer information:

In [8]: store.get_storer('df').table
Out[8]:
/df/table (Table(8,), shuffle, blosc(9)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "N": StringCol(itemsize=5, shape=(), dflt=b'', pos=1),
  "Item": StringCol(itemsize=6, shape=(), dflt=b'', pos=2),
  "Values": Float64Col(shape=(), dflt=0.0, pos=3)}
  byteorder := 'little'
  chunkshape := (2427,)
  autoindex := True
  colindexes := {
    "Values": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "index": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "Item": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "N": Index(6, medium, shuffle, zlib(1)).is_csi=False}

storer index levels:

In [9]: store.get_storer('df').levels
Out[9]: ['Item', 'N']

NOTE: if you simply omit data_columns parameter, then only indexes will be indexed in the HDF store, all other columns will NOT be searchable:

Demo:

In [19]: df.to_hdf('c:/temp/NO_data_columns.h5', 'df', format='t',mode='w',complevel=9,complib='blosc')

In [20]: store = pd.HDFStore('c:/temp/NO_data_columns.h5')

In [21]: store.select('df',where='N == "Item3"')
Out[21]:
              Values
Item   N
Items0 Item3     1.0
Items1 Item3     1.0

In [22]: store.select('df',where='N == "Item3" and Values == 1')
---------------------------------------------------------------------------
...
skipped
...

ValueError: The passed where expression: N == "Item3" and Values == 1
            contains an invalid variable reference
            all of the variable refrences must be a reference to
            an axis (e.g. 'index' or 'columns'), or a data_column
            The currently defined references are: N,index,Item,columns

UPDATE:

What is the real difference in putting data_columns=df.columns.tolist() ?

In [18]: fn = r'd:/temp/a.h5'

In [19]: df.to_hdf(fn,'dc_true',data_columns=True,format='t',mode='w',complevel=9,complib='blosc')

In [20]: df.to_hdf(fn,'dc_cols',data_columns=df.columns.tolist(),format='t',complevel=9,complib='blosc')

In [21]: store = pd.HDFStore(fn)

In [22]: store
Out[22]:
<class 'pandas.io.pytables.HDFStore'>
File path: d:/temp/a.h5
/dc_cols            frame_table  (typ->appendable_multi,nrows->8,ncols->3,indexers->[index],dc->[N,Item,Values])
/dc_true            frame_table  (typ->appendable_multi,nrows->8,ncols->3,indexers->[index],dc->[Values])

In [23]: store.get_storer('dc_true').table.colindexes
Out[23]:
{
    "Values": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "index": Index(6, medium, shuffle, zlib(1)).is_csi=False}

In [24]: store.get_storer('dc_cols').table.colindexes
Out[24]:
{
    "Item": Index(6, medium, shuffle, zlib(1)).is_csi=False,  # <- missing when `data_columns=True`
    "N": Index(6, medium, shuffle, zlib(1)).is_csi=False,     # <- missing when `data_columns=True`
    "Values": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "index": Index(6, medium, shuffle, zlib(1)).is_csi=False}

So the difference is how the index columns will be indexed

Thank you for the detailed answer. What is the real difference in putting `data_columns=df.columns.tolist()` ?. I have tried this with pandas series table instead of pandas dataframe table, which works fine without modifying any parameters that i have posted. Any way thanks, it is so hard to find such information in the official documentation. — Suraj, Nov 10 '16 at 04:28
Great, now things are more clear. Thank you for your efforts. By the way I think it is really hard to do on disk indexing with multi-indexed hdfStores — Suraj, Nov 10 '16 at 12:06

On disk indexing of Pandas multiindexed HDFStore

2 Answers2