1

Given a simple DataFrame with an integer index and a float column, this code:

store = pd.HDFStore('test.hdf5')
print store.select('df', where='index >= 50000')['A'].mean()

is at least 10 times slower than this code:

store = pd.HDFStore('test.hdf5')
print store.get('df')['A'][50000:].mean()

Table or Fixed format does not make a huge difference, the select() call, even though equivalent to slicing, is much slower.

Thanks for any insights !

flyingmig
  • 133
  • 6

1 Answers1

1

You cannot do a selection if the format is 'fixed'. That would raise an exception (and access times actually would be much faster). That said, you can directly index a fixed format.

In [39]: df = DataFrame(np.random.randn(1000000,10))

In [40]: df.to_hdf('test.h5','df',mode='w',format='table')

In [41]: def f():
    df = pd.read_hdf('test.h5','df')
    return df.loc[50001:,0]
   ....: 

In [42]: def g():
    df = pd.read_hdf('test.h5','df')
    return df.loc[df.index>50000,0]
   ....: 

In [43]: def h():
    return pd.read_hdf('test.h5','df',where='index>50000')[0]
   ....: 

In [44]: f().equals(g())
Out[44]: True

In [46]: f().equals(h())
Out[46]: True

In [47]: %timeit f()
10 loops, best of 3: 159 ms per loop

In [48]: %timeit g()
10 loops, best of 3: 127 ms per loop

In [49]: %timeit h()
1 loops, best of 3: 499 ms per loop

sure it slower. but you are doing a lot more work. This is comparing the boolean indexer vs the entire array. If you read in the entire frame then it has quite a number of advantages (e.g. caching, locality).

Of course if you are just selecting a contiguous slice, then just do

In [59]: def i():
    return pd.read_hdf('test.h5','df',start=50001)[0]
   ....: 

In [60]: i().equals(h())
Out[60]: True

In [61]: %timeit i()
10 loops, best of 3: 86.6 ms per loop
Jeff
  • 125,376
  • 21
  • 220
  • 187
  • Makes sense, thank you. But if my DataFrame is big enough that it does not fit in memory, I have no choice but to revert to select on a table, right ? Or indexing on a fixed format remains an option ? – flyingmig Jan 04 '15 at 20:17
  • you can index on a fixed format but it's only by location; in order to do actual selection you need to use table format – Jeff Jan 04 '15 at 22:50