Pandas HDFStore of MultiIndex DataFrames: how to efficiently get all indexes

Question

In Pandas, is there a way to efficiently pull out all the MultiIndex indexes present in an HDFStore in table format?

I can select() efficiently using where=, but I want all indexes, and none of the columns. I can also select() using iterator=True to save RAM, but that still means reading pretty much all the table from disk, so it's still slow.

I have been hunting in the store.root..table.* stuff, hoping that I can get a list of index values. Am I on the right track?

Plan B would be to keep a shorter MultiIndex DataFrame that just contains empty DataFrames appended every time I append the main one. I can retrieve that and get the index much more cheaply than the main one. Inelegant though.

Jeff · Accepted Answer · 2013-07-15T15:00:26.967

Create a multi-index df

In [35]: df = DataFrame(randn(100000,3),columns=list('ABC'))

In [36]: df['one'] = 'foo'

In [37]: df['two'] = 'bar'

In [38]: df.ix[50000:,'two'] = 'bah'

In [40]: mi = df.set_index(['one','two'])

In [41]: mi
Out[41]: 
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 100000 entries, (foo, bar) to (foo, bah)
Data columns (total 3 columns):
A    100000  non-null values
B    100000  non-null values
C    100000  non-null values
dtypes: float64(3)

Store it as a table

In [42]: store = pd.HDFStore('test.h5',mode='w')

In [43]: store.append('df',mi)

get_storer will return the stored object (but not retrieve the data)

In [44]: store.get_storer('df').levels
Out[44]: ['one', 'two']

In [2]: store
Out[2]: 
<class 'pandas.io.pytables.HDFStore'>
File path: test.h5
/df            frame_table  (typ->appendable_multi,nrows->100000,ncols->5,indexers->[index],dc->[two,one])

The index levels are created as data_columns, meaning you can use them in selections This is how to select only the index

In [48]: store.select('df',columns=['one'])
Out[48]: 
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 100000 entries, (foo, bar) to (foo, bah)
Empty DataFrame

To select a single column and return it as a mi-frame

In [49]: store.select('df',columns=['A'])
Out[49]: 
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 100000 entries, (foo, bar) to (foo, bah)
Data columns (total 1 columns):
A    100000  non-null values
dtypes: float64(1)

To select a single column as a Series (which can also be the index as these are stored as columns). This will be quite fast.

In [2]: store.select_column('df','one')
Out[2]: 
0     foo
1     foo
2     foo
3     foo
4     foo
5     foo
6     foo
7     foo
8     foo
9     foo
10    foo
11    foo
12    foo
13    foo
14    foo
...
99985    foo
99986    foo
99987    foo
99988    foo
99989    foo
99990    foo
99991    foo
99992    foo
99993    foo
99994    foo
99995    foo
99996    foo
99997    foo
99998    foo
99999    foo
Length: 100000, dtype: object

If you really want the fastest selection of only the index

In [4]: %timeit store.select_column('df','one')
100 loops, best of 3: 8.71 ms per loop

In [5]: %timeit store.select('df',columns=['one'])
10 loops, best of 3: 43 ms per loop

Or to get a complete index

In [6]: def f():
   ...:     level_1 =  store.select_column('df','one')
   ...:     level_2 =  store.select_column('df','two')
   ...:     return MultiIndex.from_arrays([ level_1, level_2 ])
   ...: 

In [17]: %timeit f()
10 loops, best of 3: 28.1 ms per loop

If you want the values for each levels, a pretty fast way of doing it

In [2]: store.select_column('df','one').unique()
Out[2]: array(['foo'], dtype=object)

In [3]: store.select_column('df','two').unique()
Out[3]: array(['bar', 'bah'], dtype=object)

Fantastic answer, thanks Jeff. I guess the direct answer is that looking into the store.root.*.table is fruitless. On further consideration, inspired by your response, it makes sense that there is no shortcut - you have to let the select consider all the data. Trying to get some metadata to tell what index values is just going to take the same space. By way of explanation, not all possible combinations of my multiindex are filled, so just getting the levels of the multiindex are not enough. Thank you. — Tony, Jul 15 '13 at 14:39
you might find the ``select_column(...).unique()`` useful as well, which will in effect get you the values of a particular level, not sure if that is useful to you — Jeff, Jul 15 '13 at 14:42
@Jeff I also liked the `select_column(...).unique()` solution for its elegance, only (in my case at least) it fails for large tables (due to memory error). Please see my tentative solution for this case below - I would like to see a more elegant one (I do not like mine) — eldad-a, Jan 05 '14 at 22:43
you can pass start and stop to select_column; essentially you can iterate over ranges to figure out what you need then sect the actual rows, select_as_coordinates might be useful as well as it can take a full query and just return an index of the rows you want. — Jeff, Jan 05 '14 at 22:49

eldad-a · Answer 2 · 2014-01-05T22:51:45.860

Those working with even larger tables may find up the solution suggested by Jeff to end up with Memory Error. It is a much more elegant solution but I could not use in my case (for a 2e9 rows table, datetime index, on a 16GB RAM desktop). I ended up with the following (unfortunately not elegant) solution, where h5store is the HDFStore object, a multi-indexed DataFrame, saved as a table, with timestamp index (Float64) which is a CSI one:

%%time
#ts = h5store.select_column(h5store.keys()[0], column='timestamp').unique()

chunkshape = int(1e7) # can vary due to machine and hdf5

## get a list of chunks unique timestamps
ts = [indx.index.get_level_values('timestamp').unique() 
          for indx in h5full.select(h5full.keys()[0], columns=['timestamp'],
                                    stop=None, # change for a smaller selection
                                    chunksize=chunkshape)
      ]
## drop duplicates at the the end-points
for i in range(len(ts)-1):
    if ts[i][-1]==ts[i+1][0]:
         ts[i] = ts[i][:-1]
## merge to single ndarray
ts = np.concatenate(ts)

The time for this run (over 2e9 rows):

CPU times: user 14min 16s, sys: 2min 34s, total: 16min 50s
Wall time: 14min 45s

Pandas HDFStore of MultiIndex DataFrames: how to efficiently get all indexes

2 Answers2