1

How can I efficiently find the last time stamp (from the Datetime column) for each unique value in the SecurityID column? There are roughly 1000 unique values in the SecurityID column.

Currently I query the whole table for each unique value in SecurityID, then I look up the last time stamp. As you can probably imagine its excruciatingly slow. The table is over 40GBs and growing.

I do something like this:

os.chdir('E:\HDFStores')
store = pd.HDFStore('mysuperawesomehdfstore.h5')
assets = skBase.bbg_helper_assets('minutely')
df_timestamp = pd.Dataframe()
tags = ['T', 'B', 'A']
for asset in assets:
    for tag in tags:
        print asset, " ", tag
        timestamp = (store.select('table', where = "SecurityID = ['" + asset + "'] & Tag = ['" + tag + "'] & columns = ['Datetime']")).tail(1)
        if len(timestamp_.index) == 0:
            print "value DNE"
        else:
            dt = (str(timestamp_iloc[0][0])).split(' ', 1)[0]
            tm = (str(timestamp_iloc[0][0])).split(' ', 1)[1]

I have thought about running separate python processes in my 4 core machine. But I'd rather have a cleaner way of doing things than resorting to that hack.

Any ideas would be appreciated.

enter image description here

1 Answers1

2

Because your database is huge, you have to query it from the hard disk and you face an IO bottleneck.

That's actually the main problem here. Smart code can't really compensate for having to query a 40gb file - especially considering that your query is pretty simple. And multiprocessing will not help either (it's not a CPU bottleneck). So I think that the solution goes through an update of your workflow.

All of the solutions rely either on an asynchronous operation (processing all the data first, dumping it to a separate file, read from that file as needed), or a reorganization of the way you store the data.

Asynchronous 1

If you are updating your main HDF5 file less often than daily, you could simply :

After you update the main HDF5, query for the latest Timestamp (with your existing code) for all securityID. Dump the result into a separate H5 file (just index : SecurityID, value : latest timestamp). Of course, this solution will only work if each query take less than 30 secs (that would already take almost half a day...)

You can then keep this data in memory (should be just a few Kbs), and access it anytime you need to know the latest timestamp of a specific SecurityID.

Asynchronous 2

A smarter way (but more work) would be to read the last timestamp when you receive the updated data. I don't know exactly how you are updating your HDF5 file, but i guess that you are downloading new data, and appending it to the current file.

In that case that would be the right moment to get that latest timestamp data. Your workflow would become :

  1. Download new data
  2. From the new data, for each SecurityID, get the latest Timestamp
  3. Dump / Update your 'latest_timestamp' h5 file
  4. Update your main HDF5 with the new data

H5 reorganization

If none of the above solution seem workable, then you could store each SecurityID as a separate node in the h5 file. So you'd do something like :

store.select(asset, where = Tag = ['" + tag + "'] & columns ['Datetime']")).tail(1)

It's probably a bit more work to adapt the rest of your code (and upfront work to reorganize your database), but this should very significantly reduce your query time, and it'll be helpful long term. I don't see any good reason to keep all the SecurityID bunched up in one giant node. And it's less hackish than the other solutions :-)

knightofni
  • 1,906
  • 3
  • 17
  • 22
  • Yeah, will have to go with the H5 reorg. I wonder if there is a way to query and merge different tables when row numbers between different tables are not the same. With separate tables I will have to use pandas to do the merge **after** the H5 store has been queried. Common sense would say this is a fundamental feature to support natively in H5 but I suppose there are technical reasons why this is not available (to my knowledge). –  Jan 20 '15 at 22:27