Concurrent i/o and processing for large data files in python

Question

I am going to write a python program that reads chunks from a file, processes those chunks and then appends the processed data to a new file. It will need to read in chunks as the files to process will generally be larger than the amount of ram available greatly simplified pseudocode, It will be something like this:

def read_chunk(file_object, chunksize):
    # read the data from the file object and return the chunk
    return chunk

def process_chunk(chunk):
    #process the chunk and return the processed data
    return data

def write_chunk(data, outputfile):
    # write the data tothe output file.

def main(file):
    # This will do the work
    for i in range(0, numberofchunks, chunksize):
        chunk = read_chunk(file_obj, chunksize)
        data = process_chunk(chunk)
        write_chunk(data, out_file)

What I'm wondering, is can I execute these 3 methods concurrently and how would that work?

I.e one thread to read the data, one thread to process the data and one thread to write the data. Of course, the reading thread would always need to be one 'step' ahead of the processing thread, which needs to be one step ahead of the writing thread...

What would be really great would be able to execute it concurrently and split it among processors...

More detail on the exact problem: I'll be reading data from a raster file using the GDAL library. This will read in chunks/lines into a numpy array. The processing will simply be some logical comparisons between the value of each raster cell and it's neighbours (which neighbour has a lower value than the test cell and which of those is the lowest). A new array of the same size (edges are assigned arbritary values) will be created to hold the result and this array will be written to a new raster file. I anticipate that the only other library than GDAL will be numpy, which could make this routine a good candidate for 'cythonising' aswell.

Any tips on how to proceed?

Edit:

I should point out that we have implemented similar things previously, and we know that the time spent processing will be significant compared to I/O. Another point is that the library we will use for reading the files (GDAL) will support concurrent reading...

There seems to be literally hundreds of questions asking exactly this issue, and it would seem every single one is answered by "you don't need to do that, do this instead". Did you ever end up implementing this? Could you create a community wiki/answer? — MB., Dec 13 '17 at 05:04

probinso · Answer 1 · 2016-07-05T23:43:45.587

coroutines for handling data pipelines? This template should get you started, in a way that minimizes memory profile. You could add a queuing and virtual 'fake-thread' manager to this, for multiple files.

#!/usr/bin/env python3

import time
from functools import wraps, partial, reduce

def coroutine(func_gen):
    @wraps(func_gen)
    def starter(*args, **kwargs):
        cr = func_gen(*args, **kwargs)
        _ = next(cr)
        return cr
    return starter


@coroutine
def read_chunk(file_object, chunksize, target):
    """
    read enless stream with a .read method
    """
    while True:
        buf = file_object.read(chunksize)
        if not buf:
            time.sleep(1.0)
            continue
        target.send(buf)

@coroutine
def process_chunk(target):
    def example_process(thing):
        k = range(100000000) # waste time and memory
        _ = [None for _ in k]
        value = str(type(thing))
        print("%s -> %s" % (thing, value))
        return thing

    while True:
        chunk = (yield)
        data  = example_process(chunk)
        target.send(data)

@coroutine
def write_chunk(file_object):
    while True:
        writable = (yield)
        file_object.write(writable)
        file_object.flush()


def main(src, dst):
    r = open(src, 'rb')
    w = open(dst, 'wb')

    g = reduce(lambda a, b: b(a),
               [w, write_chunk, process_chunk,
                partial(read_chunk, r, 16)]
              )
    while True:
        _ = next(g)

main("./stackoverflow.py", "retval.py")

score -2 · Answer 2 · answered Mar 19 '14 at 10:51

-2

My honest advice is not to worry about optimization right now (see premature optimization).

Unless you'll be doing a lot of operations (it doesn't seem so from your description), there's a very high chance that the I/O waiting time will be much, much larger than the processing time (i.e.: I/O is the bottleneck).

In that situation, doing the processing in multiple threads is useless. Splitting I/O and processing (as you suggested) will only buy you at most n*proc_time, with n being the number of times you process, and proc_time the processing time of each operation (not including I/O). If proc_time is much lower than the I/O time, you won't gain much.

I'd implement this serially at first, check I/O and CPU usage, and only then worry about optimization, if it seems it might be advantageous. You may also want to experiment with reading more chunks from the file at once (buffering).

answered Mar 19 '14 at 10:51

loopbackbee

21,962
10
62
97

2

No offence, but I didnt really ask for advice on working methods. To be fair on you, I didn't specify whether I was sure processing would be the bottleneck, but I always work on the 'poster knows why he is asking the question' assumption...Secondly, even if I/O was the bottleneck, having concurrent reading/writing would still give good performance gains, over reading and then writing. The question wasnt about splitting just the processing across multiple threads, but having a separate thread for reading, processing and writing, so that they may run concurrently (rather than sequentially). – jramm Mar 19 '14 at 12:14
@jramm Indeed, I assumed you didn't know in advance what kind of bottlenecks you'd be hitting. If you're sure it's worth the trouble, you can simply do each task in its own thread (I'm not sure what further tips I can offer). My personal experience is that concurrent reading and writing to the same (spinning) disk results in worse performance because of disk seeks, and [there's other people saying the same](https://mail.python.org/pipermail/python-list/2012-March/621851.html), but YMMV, of course. – loopbackbee Mar 19 '14 at 13:34
Another point on the same thing...I've seen a few references that state that python threading is ideal for I/O bound computation (The GIL is released on blocking I/O)..so it seems that threading (rather than multiprocessing) would be the answer even if I/O is the bottleneck... – jramm Mar 19 '14 at 15:33
@jramm as you mention, if you're doing any kind of blocking system calls, like I/O or network requests, the python threading model is sufficient, since there will be several threads on the waiting state, and only one in the active state. The GIL only prevents two threads from being active at the same time. – loopbackbee Mar 19 '14 at 15:51

Concurrent i/o and processing for large data files in python

2 Answers2