Searching strings through large file in Python

Question

I am currently working on my first Python project and I need to parse through a 2GB file. I've found out that if I went line by line it would be very very slow... However the buffering method, using:

f = open(filename)                  
lines = 0
buf_size = 1024 * 1024
read_f = f.read 
buf = read_f(buf_size)
while buf:
    for line in buf:
      #code for string search
      print line
    buf = read_f(buf_size)

Here the print line doesn't print a "line", it only prints a character at a time per line. So I am having problem doing substring find on it... Please Help!

The "for line" thing works with files because the file iterator is built to break input into lines. The string iterator you have here is built to break strings into characters. You'll get better performance with a larger file buffer but I can't make any promises about how much! Go back to iterating the file line by line and try a 128K buffer `open(filename, "r", 128*1024)`. — tdelaney, Oct 03 '13 at 15:53
Note: you can use [`iter(callable, sentinel)`](http://docs.python.org/3/library/functions.html#iter) to avoid the `while` loop: `for chunk in iter(lambda: f.read(1024 * 1024), ''): #search the substring`. In this case `iter` will create an iterable that calls its `callable` argument (i.e. does `callable()`) until the `sentinel` value is found. Anyway reading a 2GB file *will* take some time. Assuming your hard-disk can be read at 200 MB/s, it will take 10 secons *at the least*, and I believe HDDs are usually between 50 and 150 MB/s! — Bakuriu, Oct 03 '13 at 18:09

score 1 · Accepted Answer · answered Oct 03 '13 at 15:57

print line prints a character because buf is a string, and iterating over a string yields the characters of the string as 1-character strings.

When you say that reading line-by-line was slow, how did you implement the read? If you were using readlines(), that would explain the slowness (see http://stupidpythonideas.blogspot.com/2013/06/readlines-considered-silly.html).

Files are iterable over their lines, and Python will pick a buffer size when iterating, so this might suit your needs:

for line in f:
    # do search stuff

If you want to specify the buffer size manually, you could also do this:

buf = f.readlines(buffersize)
while buf:
    for line in buf:
        # do search stuff
    buf = f.readlines(buffersize)

Though, the first of the two is usually better.

Thanks, looks like i misunderstood about what was buf. If I do do 'for line in f:' iterating a 2G file takes me around 2 minutes. Can this be reduce even more? — Mojing Liu, Oct 03 '13 at 16:18
If you don't mind throwing memory to the wind, you can mmap the file. (see http://stackoverflow.com/questions/8151684/how-to-read-lines-from-mmap-file-in-python). Other than that, you can manually try varying the buffer size. — Cookyt, Oct 03 '13 at 16:56

score 0 · Answer 2 · answered Oct 03 '13 at 16:06

0

The problem is that buf is a string...

Say buf = "abcd"

That means, buf[0] = a, buf[1]=b and so on.

for line in buf:
    print line

would result in a b c d

That means in your for-loop, you do not loop over "lines", but over all elements of the buf-string. You may use readlines or split your buffer to single lines by looking for "\n".

answered Oct 03 '13 at 16:06

PhillipD

1,797
1
13
23

do you mean something like `for line in buf: l=line.readline()` ? – Mojing Liu Oct 03 '13 at 16:19
@MojingLiu No, he means `for line in buf.split('\n')`. – Bakuriu Oct 03 '13 at 18:13

Searching strings through large file in Python

2 Answers2