What is the optimal way to process a very large (over 30GB) text file and also show progress

Question

[newbie question]

Hi,

I'm working on a huge text file which is well over 30GB.

I have to do some processing on each line and then write it to a db in JSON format. When I read the file and loop using "for" my computer crashes and displays blue screen after about 10% of processing data.

Im currently using this:

f = open(file_path,'r')
for one_line in f.readlines():
    do_some_processing(one_line)
f.close()

Also how can I show overall progress of how much data has been crunched so far ?

Thank you all very much.

Have you tried getting rid of what you've processed already? — Ignacio Vazquez-Abrams, May 26 '11 at 22:18
f.tell() shows where you are, and as other have proposed; read file one line at a time instead of everything at once as you do above — Fredrik Pihl, May 26 '11 at 22:26

score 5 · Accepted Answer · answered May 26 '11 at 22:22

5

File handles are iterable, and you should probably use a context manager. Try this:

with open(file_path, 'r') as fh:
  for line in fh:
    process(line)

That might be enough.

answered May 26 '11 at 22:22

g.d.d.c

46,865
9
101
111

3

@Dhaivat - I don't think you understand what this code is doing, it really is quite efficient. Much better than reading the entire file contents at once, as `read()` or `readlines()` would do. – PaulMcG May 27 '11 at 02:02
@Dhaivat - Out of curiosity, what part of it are you considering inefficient? It carries a number of advantages (whole file not in memory, error handling, file handle closes automatically) without any real downside I can see. – g.d.d.c May 27 '11 at 17:44
Oops, this very embarrassing. I commented on the wrong answer. – Dhaivat Pandya May 27 '11 at 17:59
@Dhaivat - No problem. Glad you clarified though. :) – g.d.d.c May 27 '11 at 18:03

score 1 · Answer 2 · answered May 26 '11 at 22:22

I use a function like this for a similiar problem. You can wrap up any iterable with it.

Change this

for one_line in f.readlines():

You just need to change your code to

# don't use readlines, it creates a big list of all data in memory rather than
# iterating one line at a time.
for one_line in in progress_meter(f, 10000):

You might want to pick a smaller or larger value depending on how much time you want to waste printing status messages.

def progress_meter(iterable, chunksize):
    """ Prints progress through iterable at chunksize intervals."""
    scan_start = time.time()
    since_last = time.time()
    for idx, val in enumerate(iterable):
        if idx % chunksize == 0 and idx > 0: 
            print idx
            print 'avg rate', idx / (time.time() - scan_start)
            print 'inst rate', chunksize / (time.time() - since_last)
            since_last = time.time()
            print
        yield val

He also thinks readline() is iterating over lines instead of characters in a line, so I am pretty sure it is a typo. — Rob Neuhaus, May 26 '11 at 22:51

score 0 · Answer 3 · answered May 26 '11 at 22:25

Using readline imposes to find the end of each line in your file. If some lines are very long, it might lead your interpreter to crash (not enough memory to buffer the full line).

In order to show progress you can check the file size for example using:

import os
f = open(file_path, 'r')
fsize = os.fstat(f).st_size

The progress of your task can then be the number of bytes processed divided by the file size times 100 to have a percentage.

What is the optimal way to process a very large (over 30GB) text file and also show progress

3 Answers3