0

The CSV file here is about 500MB with 2.7 million rows. Through extensive testing, I have verified that for row in read_new: quadruples memory consumption, and I can't understand why. The memory increase does not happen before or after the for statement.

Can anyone shed some light on why this is happening?

I understand there are better ways to execute this script, but I have my reasons for doing it this way. I'm just trying to figure out why this is happening and if there is perhaps a more appropriate buffer to use than StringIO() for this purpose.

import io
import csv
import time

filename = 'rcs_batch_032519.csv'

csv_fob = open(filename, 'r')
fix_fob = io.StringIO()

reader = csv.reader(csv_fob)
writer = csv.writer(fix_fob)

for row in reader:
    writer.writerow(row)

fix_fob.seek(0)
read_new = csv.reader(fix_fob)

# Memory explodes here, from 634MB to 2.36GB, after executing 'for' statement
for row in read_new:
    time.sleep(30)
    pass
blhsing
  • 91,368
  • 6
  • 71
  • 106
OnNIX
  • 422
  • 4
  • 10
  • If you're going to read the entire file into memory, why don't you read it into a list of lists instead so that the data is better structured and more efficient since you don't have to rely on another instance of `csv.reader` to parse the CSV content again? – blhsing Apr 15 '19 at 16:58
  • @blhsing I will probably end up doing that at this point, but I'm also really curious about this observed behavior. – OnNIX Apr 15 '19 at 17:00
  • @OnNIX: At this point *"executing 'for' statement"* you have your data **3-Times** in memory. Compute 634MB * 3 + overhead – stovfl Apr 15 '19 at 19:40

0 Answers0