How to limit the size of pandas queries on HDF5 so it doesn't go over RAM limit?

Question

Let's say I have a pandas Dataframe

import pandas as pd

df = pd.DataFrame()

df

   Column1    Column2
0  0.189086 -0.093137
1  0.621479  1.551653
2  1.631438 -1.635403
3  0.473935  1.941249
4  1.904851 -0.195161
5  0.236945 -0.288274
6 -0.473348  0.403882
7  0.953940  1.718043
8 -0.289416  0.790983
9 -0.884789 -1.584088
........

An example of a query is df.query('Column1 > Column2')

Let's say you wanted to limit the save of this query, so the object wasn't so large. Is there "pandas" way to accomplish this?

My question is primarily for querying at HDF5 object with pandas. An HDF5 object could be far larger than RAM, and therefore queries could be larger than RAM.

# file1.h5 contains only one field_table/key/HDF5 group called 'df'
store = pd.HDFStore('file1.h5')

# the following query could be too large 
df = store.select('df',columns=['column1', 'column2'], where=['column1==5'])

Is there a pandas/Pythonic way to stop users for executing queries that surpass a certain size?

Do you want to stop them because it breaks the machine? Or do you want to find a way to still achieve their calculation without breaking the machine? If the latter, dask should be your friend — Zeugma, Oct 11 '16 at 21:13
what about using the `chunksize` parameter when calling `store.select(...)`? — MaxU - stand with Ukraine, Oct 11 '16 at 21:13
@Boud Both options are something to consider. What about the former? — ShanZhengYang, Oct 11 '16 at 21:14
@MaxU That could work. How does one implement this with a normal pandas dataframe? — ShanZhengYang, Oct 11 '16 at 21:14

score 3 · Accepted Answer · answered Oct 11 '16 at 21:17

3

Here is a small demonstration of how to use the chunksize parameter when calling HDFStore.select():

for chunk in store.select('df', columns=['column1', 'column2'],
                          where='column1==5', chunksize=10**6):
    # process `chunk` DF

answered Oct 11 '16 at 21:17

MaxU - stand with Ukraine

205,989
36
386
419

This doesn't quite answer my question, but this is an approach. If I'm integrating PyTables into software whereby users query an `HDFStore`, I would like the query to proceed until it hits "too many rows"---then, it will stop and throw an error. The above is a solution if I know a priori that the query is to large, and I want to break it up. Am I explaining the problem clearly? – ShanZhengYang Oct 11 '16 at 21:22
@ShanZhengYang, no it's still not quite clear to me... Do you want to estimate a size of resulting DF before reading it from a store? – MaxU - stand with Ukraine Oct 11 '16 at 21:28
Not necessarily, but I suspect that is the best way to do it. Let's say I try `df = store.select('df',columns=['column1', 'column2'], where=['column1==5'])` and it's larger than some limit in terms of RAM---if the limit is the limit set by the computer's hardward, the program will just freeze. Let's say I wanted to set an arbitrary limit, e.g. 4 GB. The HDF5 might be +TB or PB, so `df` could easily exceed RAM if a user were to query this object. What limitations could I put in place to stop "bad things" from happening? – ShanZhengYang Oct 11 '16 at 21:50

How to limit the size of pandas queries on HDF5 so it doesn't go over RAM limit?

1 Answers1