Read multiple lines from a file batch by batch

Question

I would like to know is there a method that can read multiple lines from a file batch by batch. For example:

with open(filename, 'rb') as f:
    for n_lines in f:
        process(n_lines)

In this function, what I would like to do is: for every iteration, next n lines will be read from the file, batch by batch.

Because one single file is too big. What I want to do is to read it part by part.

You could read all the lines with `readlines`, and then pass successive ten-line slices into `process`. — John Gordon, Sep 17 '16 at 17:00
No. Because the file is too big. What I want to do is to read it batch by batch. — fluency03, Sep 17 '16 at 17:01
try `f.read(byte_size)` where byte_size is the number of byte chars you wanna read, if that's what you want. — lycuid, Sep 17 '16 at 17:03
I want to do it lines-by-lines, since I am not sure the size of each line and the size of each of a single line is not fixed. But I have to read them always with entire line not partial of it. — fluency03, Sep 17 '16 at 17:05

ShadowRanger · Accepted Answer · 2016-09-17T17:49:46.073

itertools.islice and two arg iter can be used to accomplish this, but it's a little funny:

from itertools import islice

n = 5  # Or whatever chunk size you want
with open(filename, 'rb') as f:
    for n_lines in iter(lambda: tuple(islice(f, n)), ()):
        process(n_lines)

This will keep isliceing off n lines at a time (using tuple to actually force the whole chunk to be read in) until the f is exhausted, at which point it will stop. The final chunk will be less than n lines if the number of lines in the file isn't an even multiple of n. If you want all the lines to be a single string, change the for loop to be:

    # The b prefixes are ignored on 2.7, and necessary on 3.x since you opened
    # the file in binary mode
    for n_lines in iter(lambda: b''.join(islice(f, n)), b''):

Another approach is to use izip_longest for the purpose, which avoids lambda functions:

from future_builtins import map  # Only on Py2
from itertools import izip_longest  # zip_longest on Py3

    # gets tuples possibly padded with empty strings at end of file
    for n_lines in izip_longest(*[f]*n, fillvalue=b''):

    # Or to combine into a single string:
    for n_lines in map(b''.join, izip_longest(*[f]*n, fillvalue=b'')):

I am wondering, in the solution of using `islice`, are the n lines read at one time, or there are actually read one by one and grouped together as n-line chunk? — fluency03, Sep 19 '16 at 10:30
@ChangLiu: In all cases, they're read one by one, but there is block buffering occurring, so odds are there are only 0-2 reads needed for any given block. There is no magical way to read `n` lines as a single read; heck, at the lower layers, there is no way to read _one_ line as a single read, it's either buffering (fast, but overreading) or pulling a character at a time (no overread, but much slower). — ShadowRanger, Sep 19 '16 at 12:09

janbrohl · Answer 2 · 2018-07-30T22:17:59.910

3

You can actually just iterate over lines in a file (see file.next docs - this also works on Python 3) like

with open(filename) as f:
    for line in f:
        something(line)

so your code can be rewritten to

n=5 # your batch size
with open(filename) as f:
    batch=[]
    for line in f:
        batch.append(line)
        if len(batch)==n:
            process(batch)
            batch=[]
process(batch) # this batch might be smaller or even empty

but normally just processing line-by-line is more convenient (first example)

If you dont care about how many lines are read exactly for each batch but just that it is not too much memory then use file.readlines with sizehint like

size_hint=2<<24 # 16MB
with open(filename) as f:
    while f: # not sure if this check works
        process(f.readlines(size_hint))

edited Jul 30 '18 at 22:17

answered Sep 17 '16 at 17:36

janbrohl

2,626
1
17
15

In this way, you are actually still reading a single line every time, what I want is reading n lines every time. – fluency03 Sep 17 '16 at 17:37
There is no function for reading n lines directly - actually because of buffering (buffer size can be passed to `open` as an arg) when you read a file line by line there normally is only one disk read for many lines. – janbrohl Sep 17 '16 at 17:44
extended my answer to include `readlines` which actually is multiline-read – janbrohl Sep 17 '16 at 17:55

Read multiple lines from a file batch by batch

2 Answers2

Linked