I am going to write a python program that reads chunks from a file, processes those chunks and then appends the processed data to a new file. It will need to read in chunks as the files to process will generally be larger than the amount of ram available greatly simplified pseudocode, It will be something like this:
def read_chunk(file_object, chunksize):
# read the data from the file object and return the chunk
return chunk
def process_chunk(chunk):
#process the chunk and return the processed data
return data
def write_chunk(data, outputfile):
# write the data tothe output file.
def main(file):
# This will do the work
for i in range(0, numberofchunks, chunksize):
chunk = read_chunk(file_obj, chunksize)
data = process_chunk(chunk)
write_chunk(data, out_file)
What I'm wondering, is can I execute these 3 methods concurrently and how would that work?
I.e one thread to read the data, one thread to process the data and one thread to write the data. Of course, the reading thread would always need to be one 'step' ahead of the processing thread, which needs to be one step ahead of the writing thread...
What would be really great would be able to execute it concurrently and split it among processors...
More detail on the exact problem: I'll be reading data from a raster file using the GDAL library. This will read in chunks/lines into a numpy array. The processing will simply be some logical comparisons between the value of each raster cell and it's neighbours (which neighbour has a lower value than the test cell and which of those is the lowest). A new array of the same size (edges are assigned arbritary values) will be created to hold the result and this array will be written to a new raster file. I anticipate that the only other library than GDAL will be numpy, which could make this routine a good candidate for 'cythonising' aswell.
Any tips on how to proceed?
Edit:
I should point out that we have implemented similar things previously, and we know that the time spent processing will be significant compared to I/O. Another point is that the library we will use for reading the files (GDAL) will support concurrent reading...