5

I'm reading in a large csv file using chuncksize (pandas DataFrame), like so

reader = pd.read_csv('log_file.csv', low_memory = False, chunksize = 4e7)

I know I could just calculate the number of chunks with which it reads in the file but I would like to do it automatically and save the number of chunks into a variable, like so (in pseudo code)

number_of_chuncks = countChuncks(reader)

Any ideas?

Janis S.
  • 2,526
  • 22
  • 32
theresemoreau
  • 109
  • 2
  • 8

1 Answers1

0

You can use a generator expression to iterate through reader (a TextFileReader returned by read_csv when we define chunksize) and sum 1 for each iteration:

number_of_chunks = sum(1 for chunk in reader)

Alternatively, you can use a generator expression to count the number of lines in your file (similar logic of the first option, but iterating through the lines of the file), than divide this number by the chunksize and round up the result (with math.ceil)

import math
number_of_rows = sum(1 for row in open('log_file.csv', 'r'))
number_of_chunks = math.ceil(number_of_rows/chunksize)

or

import math
number_of_chunks = math.ceil(sum(1 for row in open('log_file.csv', 'r'))/chunksize)

In my tests, the second solution showed a better performance than the first.

Amador
  • 136
  • 1
  • 5