Processing dataframe in chunks

Question

I need to process a large dataframe in chunks and I applied this function:

def chunker(seq, size):
    return (seq[pos:pos + size] for pos in range(0, len(seq), size))

for i in chunker(df,chunk_size):
      ....

However, when I run this, I get error: ValueError: invalid literal for int() with base 10:

Do you have another way to process dataframe in chunks or to adjust above script?

thanks !

score 1 · Answer 1 · answered May 03 '22 at 14:37

You need to use iloc for this index sliciing over the rows:

def chunker(seq, size):
    return (seq.iloc[pos:pos + size] for pos in range(0, len(seq), size))

for i in chunker(df,chunk_size):
      ....

the reason is df[] is for looking up columns and it does not take a slice argument. df.loc is for row-index lookups which do not necessarily match incremental indexing (position based). You can read this for a more detailed explanation.

score 0 · Answer 2 · answered May 03 '22 at 15:23

0

thanks for your quick answer. I tried again following your recommendation, but still I get error:

ValueError: invalid literal for int() with base 10: 'xxx'

if I run the same code for a different dataframe, built using np.random.randn, it works fine. Also, the error is not given if I do not divide the dataframe in chunks. Any clue?

answered May 03 '22 at 15:23

Nic_bar

21
3

1

I understand your reply is adressing mine. Can you share more of your code? I fail to see what could not be an integer in your example. Where do you set the value of `chunk_size` for instance? – Learning is a mess May 03 '22 at 16:08

Processing dataframe in chunks

2 Answers2