6

Let's say I have a pandas Dataframe

import pandas as pd

df = pd.DataFrame()

df

   Column1    Column2
0  0.189086 -0.093137
1  0.621479  1.551653
2  1.631438 -1.635403
3  0.473935  1.941249
4  1.904851 -0.195161
5  0.236945 -0.288274
6 -0.473348  0.403882
7  0.953940  1.718043
8 -0.289416  0.790983
9 -0.884789 -1.584088
........

An example of a query is df.query('Column1 > Column2')

Let's say you wanted to limit the save of this query, so the object wasn't so large. Is there "pandas" way to accomplish this?

My question is primarily for querying at HDF5 object with pandas. An HDF5 object could be far larger than RAM, and therefore queries could be larger than RAM.

# file1.h5 contains only one field_table/key/HDF5 group called 'df'
store = pd.HDFStore('file1.h5')

# the following query could be too large 
df = store.select('df',columns=['column1', 'column2'], where=['column1==5'])

Is there a pandas/Pythonic way to stop users for executing queries that surpass a certain size?

ShanZhengYang
  • 16,511
  • 49
  • 132
  • 234

1 Answers1

3

Here is a small demonstration of how to use the chunksize parameter when calling HDFStore.select():

for chunk in store.select('df', columns=['column1', 'column2'],
                          where='column1==5', chunksize=10**6):
    # process `chunk` DF
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • This doesn't quite answer my question, but this is an approach. If I'm integrating PyTables into software whereby users query an `HDFStore`, I would like the query to proceed until it hits "too many rows"---then, it will stop and throw an error. The above is a solution if I know a priori that the query is to large, and I want to break it up. Am I explaining the problem clearly? – ShanZhengYang Oct 11 '16 at 21:22
  • @ShanZhengYang, no it's still not quite clear to me... Do you want to estimate a size of resulting DF before reading it from a store? – MaxU - stand with Ukraine Oct 11 '16 at 21:28
  • Not necessarily, but I suspect that is the best way to do it. Let's say I try `df = store.select('df',columns=['column1', 'column2'], where=['column1==5'])` and it's larger than some limit in terms of RAM---if the limit is the limit set by the computer's hardward, the program will just freeze. Let's say I wanted to set an arbitrary limit, e.g. 4 GB. The HDF5 might be +TB or PB, so `df` could easily exceed RAM if a user were to query this object. What limitations could I put in place to stop "bad things" from happening? – ShanZhengYang Oct 11 '16 at 21:50