0

I have converted a large (6GB) dataframe to a hdf5 file. It contains roughly 30m transactions from 100k user ids over 10 years. I am looking to recreate the state of each user id at 48 different points in time. So for example "for each user ID how many transactions did they have before '1 Jan 2012'". This is one summary field.

My current method is to query the hdf5 for each user_id using the pandas query function for hdf5, then cycle through each date and output the summary at the certain point in time. The issue with this is that it takes too long per user id - sometimes 10-20 seconds each. It appears that the hdf5 query takes the most time.

Is there a way that I can create an index on the hdf5 file to speed up the query or would sorting it by user_id help? All of the data manipulation is in pandas but I am unable to distribute the work across multiple machines due to environment constraints.

user1253493
  • 67
  • 1
  • 7
  • Take a look at the [indexing](http://pandas.pydata.org/pandas-docs/stable/io.html#indexing) in [HDF5 docs](http://pandas.pydata.org/pandas-docs/stable/io.html#hdf5-pytables), and the ***[cookbook](http://pandas.pydata.org/pandas-docs/stable/cookbook.html#hdfstore)***. Especially, refer to [this](http://stackoverflow.com/questions/15798209/pandas-group-by-query-on-large-data-in-hdfstore) and [this](http://stackoverflow.com/questions/25459982/trouble-with-grouby-on-millions-of-keys-on-a-chunked-file-in-python-pandas/25471765#25471765) answers. – Kartik Sep 22 '16 at 05:06
  • What is your chunking scheme in hdf5 file? Try to order it so that you read minimal amount of chunks each time. – kakk11 Sep 22 '16 at 15:33
  • I don't know what the chunking scheme is. Initially I had the data in month by month form then I wrote a loop to append it all to one file. – user1253493 Sep 23 '16 at 09:38

0 Answers0