I have converted a large (6GB) dataframe to a hdf5 file. It contains roughly 30m transactions from 100k user ids over 10 years. I am looking to recreate the state of each user id at 48 different points in time. So for example "for each user ID how many transactions did they have before '1 Jan 2012'". This is one summary field.
My current method is to query the hdf5 for each user_id using the pandas query function for hdf5, then cycle through each date and output the summary at the certain point in time. The issue with this is that it takes too long per user id - sometimes 10-20 seconds each. It appears that the hdf5 query takes the most time.
Is there a way that I can create an index on the hdf5 file to speed up the query or would sorting it by user_id help? All of the data manipulation is in pandas but I am unable to distribute the work across multiple machines due to environment constraints.