4

I have a pandas DataFrame that I have written to an HDF5 file. The data is indexed by Timestamps and looks like this:

In [5]: df
Out[5]:
                          Codes   Price  Size
Time
2015-04-27 01:31:08-04:00     T  111.75    23
2015-04-27 01:31:39-04:00     T  111.80    23
2015-04-27 01:31:39-04:00     T  113.00    35
2015-04-27 01:34:14-04:00     T  113.00    85
2015-04-27 01:55:15-04:00     T  113.50   203
...                         ...     ...   ...
2015-05-26 11:35:00-04:00    CA  110.55   196
2015-05-26 11:35:00-04:00    CA  110.55    98
2015-05-26 11:35:00-04:00    CA  110.55   738
2015-05-26 11:35:00-04:00    CA  110.55    19
2015-05-26 11:37:01-04:00        110.55    12

What I would like is to create a function that I can pass a pandas DatetimeIndex and it will return a DataFrame with the rows at or right before each Timestamp in the DatetimeIndex.

The problem I'm running into is that concatenated read_hdf queries won't work if I am looking for more than 30 rows -- see [pandas read_hdf with 'where' condition limitation?

What I am doing now is this, but there has to be a better solution:

from pandas import read_hdf, DatetimeIndex
from datetime import timedelta
import pytz

def getRows(file, dataset, index):

    if len(index) == 1:
        start = index.date[0]
        end = (index.date + timedelta(days=1))[0]
    else:
        start = index.date.min()
        end = (index.date.max() + timedelta(days=1))

    where = '(index >= "' + str(start) + '") & (index < "' str(end) + '")'

    df = read_hdf(file, dataset, where=where)

    df = df.groupby(level=0).last().reindex(index, method='pad')

    return df
Community
  • 1
  • 1
baconwichsand
  • 1,161
  • 2
  • 13
  • 31

1 Answers1

5

This is an example of using a where mask

In [22]: pd.set_option('max_rows',10)

In [23]: df = DataFrame({'A' : np.random.randn(100), 'B' : pd.date_range('20130101',periods=100)}).set_index('B')

In [24]: df
Out[24]: 
                   A
B                   
2013-01-01  0.493144
2013-01-02  0.421045
2013-01-03 -0.717824
2013-01-04  0.159865
2013-01-05 -0.485890
...              ...
2013-04-06 -0.805954
2013-04-07 -1.014333
2013-04-08  0.846877
2013-04-09 -1.646908
2013-04-10 -0.160927

[100 rows x 1 columns]

Store the tests frame

In [25]: store = pd.HDFStore('test.h5',mode='w')

In [26]: store.append('df',df)

Create a random selection of dates.

In [27]: dates = df.index.take(np.random.randint(0,100,10))

In [28]: dates
Out[28]: DatetimeIndex(['2013-03-29', '2013-02-16', '2013-01-15', '2013-02-06', '2013-01-12', '2013-02-24', '2013-02-18', '2013-01-06', '2013-03-17', '2013-03-21'], dtype='datetime64[ns]', name=u'B', freq=None, tz=None)

Select the index column (in its entirety)

In [29]: c = store.select_column('df','index')

In [30]: c
Out[30]: 
0    2013-01-01
1    2013-01-02
2    2013-01-03
3    2013-01-04
4    2013-01-05
        ...    
95   2013-04-06
96   2013-04-07
97   2013-04-08
98   2013-04-09
99   2013-04-10
Name: B, dtype: datetime64[ns]

Select the indexers that you want. This could actually be somewhat complicated, e.g. you might want a .reindex(method='nearest')

In [34]: c[c.isin(dates)] 
Out[34]: 
5    2013-01-06
11   2013-01-12
14   2013-01-15
36   2013-02-06
46   2013-02-16
48   2013-02-18
54   2013-02-24
75   2013-03-17
79   2013-03-21
87   2013-03-29
Name: B, dtype: datetime64[ns]

Select the rows that you want

In [32]: store.select('df',where=c[c.isin(dates)].index)
Out[32]: 
                   A
B                   
2013-01-06  0.680930
2013-01-12  0.165923
2013-01-15 -0.517692
2013-02-06 -0.351020
2013-02-16  1.348973
2013-02-18  0.448890
2013-02-24 -1.078522
2013-03-17 -0.358597
2013-03-21 -0.482301
2013-03-29  0.343381

In [33]: store.close()
Jeff
  • 125,376
  • 21
  • 220
  • 187
  • Thanks -- the problem I ran into with this approach is that for very large datasets I run into memory problems when pulling the entire index column into memory. Therefore I'm looking for a way to slice by a list of indeces on the HDF5 table directly without having to pull anything into memory – baconwichsand May 27 '15 at 19:40
  • really, you can't pull a single column into memory (note that you have to use exactly what I did above, ``select_column``), NOT a regular select, which ONLY pulls that single column into memory. A regular select does a reindex and will use a lot of memory. How big is the dataset and how much memory? assume you are using 64-bit. – Jeff May 27 '15 at 20:00
  • I was running into memory issues when pulling the whole dataset into memory thats why I started the read_hdf query approach. The datasets I'm dealing with are about 10m rows, 3 columns, I was running into memory errors when loading the sets from HDF5. Unfortunately have to work with 32bit under Windows. I will try your method!!!! Ideally I would just read the sets into memory once as all the IO is massively slowing down my operation – baconwichsand May 28 '15 at 00:33
  • I have 8GB memory will be upgrading to 24GB in next fews days maybe that will allow me to load everything upfront – baconwichsand May 28 '15 at 00:34
  • 3
    32 but on windows has a 2gb (at most limit); you should simply install 64-bit Python – Jeff May 28 '15 at 00:41