0

Suppose I have a large amount of data that I am loading into a dataframe by chunks;? For eg : I have a table which is more than 40 Gb and selecting 3 columns may be around 2 - 3 gb suppose and records are 10 million (count of rows)

c = pd.read_sql("select a,b,c from table;", con=db, chunksize=10**2):
b = c['a']

Since It is reading the table chunk by chunk does it mean it is not loading the whole 3 gb in memory at once and operate only on 10^2 mb at once then goto next chunk automatically??

If not, how to make it behave like this?

Viv
  • 1,474
  • 5
  • 28
  • 47

1 Answers1

1

Quoting the documentation

chunksize : int, default None
    If specified, return an iterator where chunksize is the number of rows
    to include in each chunk.

So first of all, chunksize denotes the number of rows and not the size in mb. Providing a chunksize also has the effect, that an iterator is returned instead of a dataframe. So you need to loop over that. Given that, on the python side you only need memory for the 10^2 rows.

nucleon
  • 1,128
  • 1
  • 6
  • 19
  • Ah so, chunksize gives the row count. when you are saying loop over that, we just need a for loop like : for c in pd.read_sql("select a,b,c from table;", con=db, chunksize=10**2): – Viv May 26 '17 at 16:01
  • That would be a way to do it. `iter = pd.read_sql("select a,b,c from table;", con=db, chunksize=10**2)` and `for df in iter:` would be equivalent. – nucleon May 26 '17 at 16:07