Pandas and HDF5 aggregate performance

Question

I'm trying to understand the ideal way to organise data within Pandas to achieve the best aggregating performance. The data I am dealing with is of the form yyyy-mm.csv which I just read_csv in and then to_hdf out. It generally looks something a bit like this:

ObjectID Timestamp ParamA ParamB --> ParamZ
1 2013-01-01 00:00:00 1 9
2 2013-01-01 00:00:00 3 2
1 2013-01-01 00:10:00 8 11
2 2013-01-01 00:10:00 6 14

There are about 50 object ids and readings for each batch of 10 minutes for the whole month. The end result I want to achieve is aggregated data (e.g. the mean) for a single parameter grouped by month (or potentially finer resolution eventually) over say 5 years.

What I've discovered so far is that a HDFStore.select of a single column isn't really a great deal quicker than bringing in all of those params into a single data frame at once. Therefore it feels very wasteful and the performance is not great. Without knowing exactly why this is, I can't really decide the best way to move forward. It seems that if the data were transposed such that the yyyy-mm was along the x axis with the dd hh:mm:ss down the y axis, and there were one of these data frames per parameter that the performance would massively improve as it could bring in more data in one hit. The groupby's are really quick once things have been read in from disk. However I'm not at all convinced that this is how it is supposed to be used. Can anyone advise the best way to organise and store the data?

Thanks

pls show a sample frame with an aggregation you want to do with runnable code (in small size of course). — Jeff, Nov 27 '13 at 23:10

Jeff · Answer 1 · 2013-11-28T03:19:12.073

Pls review the HDFStore docs here, and the cookboo recipies here

PyTables stores data in a row-oriented format, so it behooves you to generally have long and not so wide tables. However, if you tend to query and need/want the entire row then the width does not present a problem.

On the other hand, if you are generally after a small subset of columns, you will want to shard the table into multiples (possibly with the same indexing scheme), so you can use a 'master' table to run the query, then select 'columns' (other tables) as needed. You can accomplish this via the append_to_multiple/select_from_multiple methods for example. Taken to the extreme, this you could store a single column in a separate group and make yourself a column-oriented table. However this will substantially slow down if say you tend to select a lot of columns.

Furthermore you always want to have the queryable columns as indexes or data_columns, as these allow queries in the first place and are indexed.

So it comes down to the ratio of queries that select lots of columns vs single-column selections.

For example

In [5]: df = DataFrame(np.random.randn(16,2),
                       columns=['A','B'],
                       index=MultiIndex.from_tuples(
                           [ (i,j) for i in range(4) for j in date_range(
                                  '20130101 00:00:00',periods=4,freq='10T') ],
                           names=['id','date']))

In [6]: df
Out[6]: 
                               A         B
id date                                   
0  2013-01-01 00:00:00 -0.247945  0.954260
   2013-01-01 00:10:00  1.035678 -0.657710
   2013-01-01 00:20:00 -2.399376 -0.188057
   2013-01-01 00:30:00 -1.043764  0.510098
1  2013-01-01 00:00:00 -0.009998  0.239947
   2013-01-01 00:10:00  2.038563  0.640080
   2013-01-01 00:20:00  1.123922 -0.944170
   2013-01-01 00:30:00 -1.757766 -1.398392
2  2013-01-01 00:00:00 -1.053324 -1.015211
   2013-01-01 00:10:00  0.062408 -1.476484
   2013-01-01 00:20:00 -1.202875 -0.747429
   2013-01-01 00:30:00 -0.798126 -0.485392
3  2013-01-01 00:00:00  0.496098  0.700073
   2013-01-01 00:10:00 -0.042914  1.099115
   2013-01-01 00:20:00 -1.762597 -0.239100
   2013-01-01 00:30:00 -0.344125 -1.607524

[16 rows x 2 columns]

In 0.12, use table=True rather than format

In [7]: df.to_hdf('test.h5','df',mode='w',format='table')

In [8]: store = pd.HDFStore('test.h5')

In [9]: store
Out[9]: 
<class 'pandas.io.pytables.HDFStore'>
File path: test.h5
/df            frame_table  (typ->appendable_multi,nrows->16,ncols->4,indexers->[index],dc->[date,id])

In [10]: store.select('df',where='id=0')
Out[10]: 
                               A         B
id date                                   
0  2013-01-01 00:00:00 -0.247945  0.954260
   2013-01-01 00:10:00  1.035678 -0.657710
   2013-01-01 00:20:00 -2.399376 -0.188057
   2013-01-01 00:30:00 -1.043764  0.510098

[4 rows x 2 columns]

This is 0.13 syntax, this is a bit more tricky in 0.12

In [18]: store.select('df',where='date>"20130101 00:10:00" & date<"20130101 00:30:00"')
Out[18]: 
                               A         B
id date                                   
0  2013-01-01 00:20:00 -2.399376 -0.188057
1  2013-01-01 00:20:00  1.123922 -0.944170
2  2013-01-01 00:20:00 -1.202875 -0.747429
3  2013-01-01 00:20:00 -1.762597 -0.239100

[4 rows x 2 columns]

In [19]: store.close()

So for example to do a groupby on the id, you can select all of the unique ids (use the select_column method. Then iterate over these, doing a query and performing your function on the results. This will be quite fast and these are indexed columns. Something like this:

In [24]: ids = store.select_column('df','id').unique()

In [25]: ids
Out[25]: array([0, 1, 2, 3])

In [27]: pd.concat([ store.select('df',where='id={0}'.format(i)).sum() for i in ids ],axis=1)
Out[27]: 
          0         1         2         3
A -2.655407  1.394721 -2.991917 -1.653539
B  0.618590 -1.462535 -3.724516 -0.047436

[2 rows x 4 columns]

A multi-groupby is just a combination query, e.g. id=1 & date>="20130101 00:10:00' & date<='20130101 00:30:00'

You might find this example instructive as well here

Pandas and HDF5 aggregate performance

1 Answers1