5

I have created an HDFStore. The HDFStore contains a group df which is a table with 2 columns. The first column is a string and second column is DateTime(which will be in sorted order). The Store has been created using the following method:

from numpy import ndarray
import random
import datetime
from pandas import DataFrame, HDFStore


def create(n):
    mylist = ['A' * 4, 'B' * 4, 'C' * 4, 'D' * 4]
    data = []
    for i in range(n):
        data.append((random.choice(mylist),
                     datetime.datetime.now() - datetime.timedelta(minutes=i)))

    data_np = ndarray(len(data), dtype=[
                      ('fac', 'U6'), ('ts', 'datetime64[us]')])
    data_np[:] = data
    df = DataFrame(data_np)
    return df


def create_patches(n, nn):
    for i in range(n):
        yield create(nn)


df = create_patches(100, 1000000)
store = HDFStore('check.hd5')
for each in df:
    store.append('df', each, index=False, data_columns=True, format = 'table')
store.close()

Once the HDF5 file is created, i'm querying the table using the following method:

In [1]: %timeit store.select('df', ['ts>Timestamp("2016-07-12 10:00:00")'])
1 loops, best of 3: 13.2 s per loop

So, basically this is taking 13.2 seconds, then I added an index to this column using

In [2]: store.create_table_index('df', columns=['ts'], kind='full')

And then I again did the same query, this time I got the following:-

In [3]: %timeit store.select('df', ['ts>Timestamp("2016-07-12 10:00:00")'])
1 loops, best of 3: 12 s per loop

From the above, it seems to me there isn't a significant improvement in the performance. So, my question is, what else can I do here to make my query faster, or is there something I'm doing wrong?

harmands
  • 1,082
  • 9
  • 24

1 Answers1

3

I think your columns has already been indexed when you specified data_columns=True...

See this demo:

In [39]: df = pd.DataFrame(np.random.randint(0,100,size=(10, 3)), columns=list('ABC'))

In [40]: fn = 'c:/temp/x.h5'

In [41]: store = pd.HDFStore(fn)

In [42]: store.append('table_no_dc', df, format='table')

In [43]: store.append('table_dc', df, format='table', data_columns=True)

In [44]: store.append('table_dc_no_index', df, format='table', data_columns=True, index=False)

the data_columns wasn't specified, so only index is indexed:

In [45]: store.get_storer('table_no_dc').group.table
Out[45]:
/table_no_dc/table (Table(10,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": Int32Col(shape=(3,), dflt=0, pos=1)}
  byteorder := 'little'
  chunkshape := (3276,)
  autoindex := True
  colindexes := {
    "index": Index(6, medium, shuffle, zlib(1)).is_csi=False}

data_columns=True - all data columns have been indexed:

In [46]: store.get_storer('table_dc').group.table
Out[46]:
/table_dc/table (Table(10,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "A": Int32Col(shape=(), dflt=0, pos=1),
  "B": Int32Col(shape=(), dflt=0, pos=2),
  "C": Int32Col(shape=(), dflt=0, pos=3)}
  byteorder := 'little'
  chunkshape := (3276,)
  autoindex := True
  colindexes := {
    "C": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "A": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "index": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "B": Index(6, medium, shuffle, zlib(1)).is_csi=False}

data_columns=True, index=False - we have data columns information, but no indexes for them:

In [47]: store.get_storer('table_dc_no_index').group.table
Out[47]:
/table_dc_no_index/table (Table(10,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "A": Int32Col(shape=(), dflt=0, pos=1),
  "B": Int32Col(shape=(), dflt=0, pos=2),
  "C": Int32Col(shape=(), dflt=0, pos=3)}
  byteorder := 'little'
  chunkshape := (3276,)

colindexes - shows the list of indexed columns in the examples above

MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • so in the third case, were no indexes created? I think im getting confused here – harmands Jul 18 '16 at 16:50
  • ```/df/table (Table(100000000,)) '' description := { "index": Int64Col(shape=(), dflt=0, pos=0), "fac": StringCol(itemsize=4, shape=(), dflt=b'', pos=1), "ts": Int64Col(shape=(), dflt=0, pos=2)} byteorder := 'little' chunkshape := (6553,) autoindex := True colindexes := { "ts": Index(6, full, shuffle, zlib(1)).is_csi=True} ``` This is what im getting, basically I have only one index, on `ts` itself – harmands Jul 18 '16 at 16:56
  • `colindexes := { "ts": Index(6, full, shuffle, zlib(1)).is_csi=True}` - `ts` is indexed – MaxU - stand with Ukraine Jul 18 '16 at 16:59
  • thank you, one more thing, is this the correct way to store/query dates? Assuming that the dates are sorted. – harmands Jul 18 '16 at 17:02
  • @harman786, yes, you can also [compare your `ts` with variables](http://stackoverflow.com/a/38401560/5741205) if you want. I would also recommend to disable indexing, when adding data to HDFStore and then reindex everything in one step - it should be faster – MaxU - stand with Ukraine Jul 18 '16 at 17:06
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/117627/discussion-between-harman786-and-maxu). – harmands Jul 18 '16 at 17:09