Python append to tarfile in parallel

Question

import tarfile
from cStringIO import StringIO
from io import BytesIO as BIO

unique_keys = ['1:bigstringhere...:5'] * 5000
file_out = BytesIO()
tar = tarfile.open(mode='w:bz2', fileobj=file_out)
for k in unique_keys:
    id, mydata, s_index= k.split(':')
    inner_fname = '%s_%s.data' % (id, s_index)
    info = tarfile.TarInfo(inner_fname)
    info.size = len(mydata)
    tar.addfile(info, StringIO(mydata))
tar.close()

I would like to do the above loop to add to the tarfile (tar) in parallel for faster execution.

Any ideas?

why not simply create the tar in another thread using [`threading`](http://docs.python.org/2/library/threading.html)? [Here's a good overview](http://softwareramblings.com/2008/06/running-functions-as-threads-in-python.html) of the technique — loopbackbee, Oct 15 '13 at 10:20
@goncalopp what do u mean create the tar in another thread? In the above code the "expensive" operation is the tar.addfile . Can you give me an example of what you mean? Thanks — Giorgos Komnino, Oct 15 '13 at 10:26
While the expensive operation is `tar.addfile`, it's just cleaner to open, write and close the file in another thread. If you didn't, you would need to [`join`](http://docs.python.org/2/library/threading.html#threading.Thread.join) the thread before closing, effectively killing the benefits of doing parallel work. All you need to do is define a new function that takes as an argument your data, and opens, writes and closes the tarfile. Then, just execute that function in another thread, as in the link I mentioned earlier — loopbackbee, Oct 15 '13 at 12:40
@goncalopp Thanks for trying to help, but I think you do not understand the problem. Here we have a for loop which is executed 5000 times and appends the mydata to the tar file. I want to append to the file in parrallel. The optimal would be to have 5000 threads and each thread takes one of the unique keys and adds it to the file . Hope the problem is clearer now. — Giorgos Komnino, Oct 16 '13 at 09:07
I don't think writing to the file concurrently is a [very good idea](http://en.wikipedia.org/wiki/Race_condition#File_systems). It probably won't be any faster, anyway, since the bottleneck should be the disk, not the processor. Did you manage to do it in a single thread? — loopbackbee, Oct 16 '13 at 14:09
file_out = BytesIO() # this is not a file it is a bytestream If I need to do it faster I think I need a custom datastucture. — Giorgos Komnino, Oct 17 '13 at 07:13

score 1 · Answer 1 · answered Nov 07 '18 at 02:35

You cannot write multiple files to the same tarfile, at the same time. If you try to do so, the blocks will get intermingled, and it will be impossible to extract them.

You could do it by starting multiple threads, then each thread can open a tarfile, write to it, and close it.

I believe you can probably join tarfiles end-to-end. Normally, this would involve reading the tarfiles back at at the end, but since this is all in memory (and presumably, the size is small enough to allow that), this won't be so much of an issue.

If you take this approach, you don't want 5000 individual threads - 5000 threads will make the box stop responding (at least for a while), and the compression will be awful. Limit yourself to 1 thread per processor, and divide the work by the threads.

Also, your code, as written, will create a tar with 5000 files, all called 1_5.data, and with the contents "bigstringhere...". I'm assuming this is just an example. If not, create a tarfile with a single file, close it (to flush it), then duplicate the result 5000 times (e.g. if you then want to write it to disk, just write the entire BytesIO 5000 times).

I believe the most expensive part of this is the compression - you could use the external program 'pigz', which does gzip compression in parallel.

Python append to tarfile in parallel

1 Answers1