Archiving bunch of files in the fastest possible way

Question

I'm implementing a torrent downloading and archiving system. I want to download a torrent file (which contain several small files) and then archive it. My Disk performance is poor. so i want an efficient way of archiving files.

I have several options:

1. Download files on normal disk/filesystem and then TAR it using normal unix tar command.

2. Create blank TAR archive and then mount it in write mode using archivemount and then start downloading torrent in the mounted path.

3. Similar to option 2 but using ZIP file instead of tar.

4. As I want to deliver files over a web browser: Implement a software/script to TAR a folder on the fly. (i wrote a python script (uWsgi/Nginx) years ago to do this. But as tar requires a checksum for each file. The performance was pretty poor)

5. Find a torrent client that can write directly into a TAR/Zip file. (Very unlikely)

Which way should I consider?

Thank you.

score 0 · Answer 1 · answered Oct 15 '15 at 09:01

The best for performance is actually still likely to be 4, if disk is really your true bottleneck. This stops you having to spend precious IOPS on copying files from one place to another.

Also, option 4 is really the only option that will allow the client to instantly download the torrent once your server is done downloading, meaning the client can actually get to their data sooner. Also, this way you have the option of easilly permitting the user to download individual files (dead simple since they're just sitting there on your filesystem).

I would investigate why tar was giving you such poor performance. I really doubt it's the checksums that was your problem, since they are not even on the data as far as I can remember. Any reason you can't just pipe the output from GNU tar directly down the web browser rather than writing your own tar packer?

One challenge would be to provide a correct content-length to the client with this approach. If you don't care about that you could just omit sending this, and then your client would just not see a percentage counter for the download. This might not matter depending on your application.

Performance was poor because of Content length and also to handle multi-connection download, i had to consider http-range header, seek files and then start the output at the correct offset. which was slow. — Pedram A, Oct 15 '15 at 09:19
Oh, that's right, if you just use GNU tar with a http-range header, you'd end up just having to tar it from the start and just discard the output until you get where you want. You might be able to mitigate that with a custom tar packer though. Are connections really typically that unreliable though that a lot of resuming is required? It'd seem that a simple way to mitigate this would be to simply disallow multi-connection downloads. — Per von Zweigbergk, Oct 15 '15 at 09:45

Archiving bunch of files in the fastest possible way

1 Answers1