2

I need to process a large dataframe in chunks and I applied this function:

def chunker(seq, size):
    return (seq[pos:pos + size] for pos in range(0, len(seq), size))

for i in chunker(df,chunk_size):
      ....
 

However, when I run this, I get error: ValueError: invalid literal for int() with base 10:

Do you have another way to process dataframe in chunks or to adjust above script?

thanks !

Nic_bar
  • 21
  • 3

2 Answers2

1

You need to use iloc for this index sliciing over the rows:

def chunker(seq, size):
    return (seq.iloc[pos:pos + size] for pos in range(0, len(seq), size))

for i in chunker(df,chunk_size):
      ....

the reason is df[] is for looking up columns and it does not take a slice argument. df.loc is for row-index lookups which do not necessarily match incremental indexing (position based). You can read this for a more detailed explanation.

Learning is a mess
  • 7,479
  • 7
  • 35
  • 71
0

thanks for your quick answer. I tried again following your recommendation, but still I get error:

ValueError: invalid literal for int() with base 10: 'xxx'

if I run the same code for a different dataframe, built using np.random.randn, it works fine. Also, the error is not given if I do not divide the dataframe in chunks. Any clue?

Nic_bar
  • 21
  • 3
  • 1
    I understand your reply is adressing mine. Can you share more of your code? I fail to see what could not be an integer in your example. Where do you set the value of `chunk_size` for instance? – Learning is a mess May 03 '22 at 16:08