20

I've recently started working with S3 and have come across this need to upload and compress large files (10 GB +-) to S3. The current implementation I'm working with is creating a temporary compressed file locally and then uploading it to S3 and finally deleting the temporary file. For a 10 GB file, I have almost 20 GB locally stored until the upload is done. I need a way to transfer the file to s3 and then compress it there. Is this approach viable? If yes, how should I address it? If not, is there any way I can minimize the local space needed? I've seen someone suggesting that the file could be uploaded to the S3, downloaded to an EC2 in the same region, compressed there and then uploaded back to the S3 while deleting the first copy on S3. This might work but it seems that 2 uploads for getting one file up wouldn`t be an advantage cost-wise.

I've tried to upload a compression stream without success but I`ve just discovered S3 does not support compression streaming and now I am clueless as to how to proceed.

I'm using the gzip library on .NET

Benji
  • 45
  • 8
VmLino
  • 291
  • 1
  • 2
  • 7

5 Answers5

25

In the linux shell, via aws-cli, this was added about 3 months after you asked the question :-)

Added the ability to stream data using cp

So the best you can do, I guess, is to pipe the output of gzip to aws cli:

Upload from stdin:

gzip -c big_file | aws s3 cp - s3://bucket/folder/big_file.gz

Download to stdout:

aws s3 cp s3://bucket/folder/big_file.gz - | gunzip -c ...

Ferdinand.kraft
  • 12,579
  • 10
  • 47
  • 69
6

If space is at a premium at the location where you initially the file, then uploading the file to S3, and subsequently downloading, compressing, and re-uploading the file to S3 on an EC2 instance in the same region as the S3 bucket is actually a very sensible (if seemingly counter-intuitive) suggestion, for one simple reason:

AWS does not charge you for bandwidth between EC2 and S3 within the same region.

This is an ideal job for a spot instance... and a good use case for SQS to tell the spot machine what needs to be done.

On the other hand... you're spending more of your local bandwidth uploading that file if you don't compress it first.

If you are a programmer, you should be able to craft a utility similar to the one I have written for internal use (this is not a plug; it's not currently available for release) that compresses (via external tools) and uploads files to S3 on-the-fly.

It works something like this pseudocode example command line:

cat input_file | gzip -9c | stream-to-s3 --bucket 'the-bucket' --key 'the/path'

That's a simplified usage example, to illustrate the concept. Of course, my "stream-to-s3" utility accepts a number of other arguments, including x-amz-meta metadata, the aws access key and secret, but you get the idea, perhaps.

Common compression utilities like gzip, pigz, bzip2, pbzip2, xz, and pixz all can read the source file from STDIN and write the compressed data to STDOUT without ever writing the compressed version of the file to disk.

The utility I use reads the file data from its STDIN via the pipeline, and, using S3 Multipart Upload (even for small files that don't technically need it, because S3 Multipart Upload cleverly does not require you to know the size of the file in advance), it just keeps sending data to S3 until it reaches EOF on its input stream. Then it completes the multipart upload and ensures that everything succeeded.

I use this utility to build and upload entire tarballs, with compression, without ever touching a single block of disk space. Again, it was not particularly difficult to write, and could have been done in a number of languages. I didn't even use any S3 SDK, I rolled my own from scratch, using a standard HTTP user agent and the S3 API documentation.

Michael - sqlbot
  • 169,571
  • 25
  • 353
  • 427
  • OP says "S3 does not support compression streaming." I'm not sure what that really means, but I do know my answer is not theoretical. I stream dozens of GB of heavily compressed data to S3 on the fly on a daily basis. S3 supports what is effectively "streaming" via multipart uploads and is agnostic to the compressedness of what's being uploaded. – Michael - sqlbot Jun 06 '14 at 00:19
  • What I meant is that I couldn't compress it while uploading. A file that is compressed uploads just fine. So, if i read it right, your code actually manages to read a filestream, compress it and upload with multipart (I'm familiar with this function) without ever using a temp file? – VmLino Jun 06 '14 at 02:54
  • I see. And that's what I'm doing, uploading a file that is compressed, but I'm doing it as the compression algorithm feeds me compressed data on the pipe. – Michael - sqlbot Jun 06 '14 at 03:00
  • I tried to compress and feed multipart with a stream containing the compressed data. The thing is, I did not manage to keep the flow going and ended up with 3 corrupted .gz files, cause multipart closed every part as a file. Maybe I messed up the code at some point. And after I got my boss telling me about how you can't compress partially the file and then put it all together i figured i might come here for help. It's curious to see something so similar to my first approach here. – VmLino Jun 06 '14 at 03:40
  • One [multipart upload](http://docs.aws.amazon.com/AmazonS3/latest/dev/mpuoverview.html) would only ever possibly create one file (object) in S3, so if you somehow ended up with 3 files, that would be a flaw in your implementation, rather than in the general principle of what you were trying. – Michael - sqlbot Jun 06 '14 at 11:07
  • So, if my multipart function is in an external method and I called it in a loop that feeds it with a stream with parts of my original file compressed, for it to work I'd like to feed it with the response Id as a parameter on every call? That was gonna be my next try before i was told it would not work. – VmLino Jun 06 '14 at 13:59
  • Yes, you'd need to feed it the upload id and the part number that it is supposed to send... and collect, from it, the ETag returned by the put-part call. You have to combine all of those etags together with the upload id when you call complete-multipart-upload. Each part you upload (except the last) must be at least 5 MB in size. If you split the compressed output on boundaries of, say, 5MB, and upload each 5MB chunk, S3 reassembles the parts into exactly the original data. – Michael - sqlbot Jun 06 '14 at 21:15
2

I need a way to transfer the file to s3 and then compress it there. Is this approach viable?

This approach is not viable/not optional. Compression takes a lot of CPU resources, and Amazon S3 is in the business of storing data, not performing heavy duty processing of your files.

With S3 you are also paying bandwidth for what you upload, so you are wasting money sending more data then need be.

I've seen someone sugesting that the file coud be uploaded to the S3, downloaded to an EC2 in the same region, compressed there and then uploaded back to the S3 while deleting the first copy on S3.

What you could do is upload directly to EC2, compress there, and then upload to S3 from there. But now you've moved your 20GB problem from you local machine, to the EC2 instance.

The best approach is to continue using your current approach of compressing locally and then uploading.

Martin Konecny
  • 57,827
  • 19
  • 139
  • 159
1

One very important S3 feature for upload throughput is parallel upload. There are several tools that does that, such as aws cli, s3cmd or crossftp. From a .NET API, the same could be achieved using the TransferUtility class

If you truly need compression, take a look at S3DistCP, a tool that can do transfers using several machines in parallel and compress on the fly.

Julio Faerman
  • 13,228
  • 9
  • 57
  • 75
0

If you are using .NET you could do a char stream, but you would still need some local storage greater than 20 GB.

Additionally to be the bearer of bad news S3 from amazon is just storage. You may need to spin up another service (aws) than can run a program that can compress on the storage. So your app uploads and compresses using the S3 storage.

If your project is smaller you may want to consider a IaaS provider rather than PaaS. That way storage and app can be on the same set of servers.

Kyle_at_NU
  • 60
  • 3