20

I have a large local file. I want to upload a gzipped version of that file into S3 using the boto library. The file is too large to gzip it efficiently on disk prior to uploading, so it should be gzipped in a streamed way during the upload.

The boto library knows a function set_contents_from_file() which expects a file-like object it will read from.

The gzip library knows the class GzipFile which can get an object via the parameter named fileobj; it will write to this object when compressing.

I'd like to combine these two functions, but the one API wants to read by itself, the other API wants to write by itself; neither knows a passive operation (like being written to or being read from).

Does anybody have an idea on how to combine these in a working fashion?

EDIT: I accepted one answer (see below) because it hinted me on where to go, but if you have the same problem, you might find my own answer (also below) more helpful, because I implemented a solution using multipart uploads in it.

Tomerikoo
  • 18,379
  • 16
  • 47
  • 61
Alfe
  • 56,346
  • 20
  • 107
  • 159

3 Answers3

29

I implemented the solution hinted at in the comments of the accepted answer by garnaat:

import cStringIO
import gzip

def sendFileGz(bucket, key, fileName, suffix='.gz'):
    key += suffix
    mpu = bucket.initiate_multipart_upload(key)
    stream = cStringIO.StringIO()
    compressor = gzip.GzipFile(fileobj=stream, mode='w')

    def uploadPart(partCount=[0]):
        partCount[0] += 1
        stream.seek(0)
        mpu.upload_part_from_file(stream, partCount[0])
        stream.seek(0)
        stream.truncate()

    with file(fileName) as inputFile:
        while True:  # until EOF
            chunk = inputFile.read(8192)
            if not chunk:  # EOF?
                compressor.close()
                uploadPart()
                mpu.complete_upload()
                break
            compressor.write(chunk)
            if stream.tell() > 10<<20:  # min size for multipart upload is 5242880
                uploadPart()

It seems to work without problems. And after all, streaming is in most cases just a chunking of the data. In this case, the chunks are about 10MB large, but who cares? As long as we aren't talking about several GB chunks, I'm fine with this.


Update for Python 3:

from io import BytesIO
import gzip

def sendFileGz(bucket, key, fileName, suffix='.gz'):
    key += suffix
    mpu = bucket.initiate_multipart_upload(key)
    stream = BytesIO()
    compressor = gzip.GzipFile(fileobj=stream, mode='w')

    def uploadPart(partCount=[0]):
        partCount[0] += 1
        stream.seek(0)
        mpu.upload_part_from_file(stream, partCount[0])
        stream.seek(0)
        stream.truncate()

    with open(fileName, "rb") as inputFile:
        while True:  # until EOF
            chunk = inputFile.read(8192)
            if not chunk:  # EOF?
                compressor.close()
                uploadPart()
                mpu.complete_upload()
                break
            compressor.write(chunk)
            if stream.tell() > 10<<20:  # min size for multipart upload is 5242880
                uploadPart()
Teivaz
  • 5,462
  • 4
  • 37
  • 75
Alfe
  • 56,346
  • 20
  • 107
  • 159
  • 1
    How is the mpu defined? `s3.Bucket('').Object('')` How is this different from `boto3.client('s3')` where we use `s3.create_multipart_upload(Bucket=dst_bucket, Key=dst_key)` – user 923227 Mar 01 '19 at 03:39
  • 1
    `session = boto3.session.Session();s3 = session.resource('s3');bucket = s3.Bucket(bucket_name);mpu = bucket.initiate_multipart_upload(key);` – Eugene Ramirez Mar 01 '19 at 04:40
  • I changed `stream.tell() > 10<<20` to `10<<19` and the gz size jumped from 15 MB to 62 MB. – user 923227 Mar 01 '19 at 22:21
  • 1
    Interesting effect. But I see no connection to the change. Let's hear after your investigation what was the reason for this. 10<<19 should also be okay according to what the documentation says (because that's exactly the lower limit of 5242880 bytes for a multipart upload). – Alfe Mar 04 '19 at 10:01
  • Hi Alfe, I saw that once the first chunk is out and the stream size drops then the stream size starts increasing very fast. For me the max zip size will be less than 5 MB, I did not investigate further. – user 923227 Mar 04 '19 at 19:40
  • 1
    create_multipart_upload() only accepts keyword arguments. – syberkitten Aug 07 '20 at 09:14
  • 1
    The final uploaded file does not seem readable after decompressing. Anything missing in this piece of code? – Meet Shah Aug 02 '23 at 22:14
  • @MeetShah When I programmed this over ten years ago, it worked. I tested it thoroughly, so the code *wasn’t* missing anything back then. But of course things might have changed in the meantime. Maybe try to adopt the idea and create your own solution. You also could update the answer then! (I won’t do this anymore, I’ve moved on to other topics since then, so good luck!) – Alfe Aug 08 '23 at 17:47
8

You can also compress Bytes with gzip easily and upload it as the following easily:

import gzip
import boto3

cred = boto3.Session().get_credentials()

s3client = boto3.client('s3',
                            aws_access_key_id=cred.access_key,
                            aws_secret_access_key=cred.secret_key,
                            aws_session_token=cred.token
                            )

bucketname = 'my-bucket-name'      
key = 'filename.gz'  

s_in = b"Lots of content here"
gzip_object = gzip.compress(s_in)

s3client.put_object(Bucket=bucket, Body=gzip_object, Key=key)

It is possible to replace s_in by any Bytes, io.BytesIO, pickle dumps, files, etc.

If you want to upload compressed Json then here is a nice example: Upload compressed Json to S3

Rene B.
  • 6,557
  • 7
  • 46
  • 72
  • Looks like this tries to work with the whole contents in memory, right? Consider a 10GB log file I want to upload. Would this be feasible using your approach? – Alfe Sep 04 '19 at 10:27
  • @Alfe true, the file should fit in this approach in memory. However, its an easier solution to the title of your question "How to gzip while uploading into s3 using boto". – Rene B. Sep 04 '19 at 11:07
  • No, because strictly speaking you do not gzip _while_ uploading but prior to it (in memory). In my usecase I had a very large file (10GB or similar) and wanted to store a gzipped version of it in S3. The only straight-forward way of doing it was to gzip the file _before_ I upload it but that would have meant needing to provide the additional storage or runtime memory; also doing compression while uploading seems feasible as it does two things in parallel. My question aimed for exactly this. – Alfe Sep 04 '19 at 14:21
  • This should be the newly accepted answer. The answer by @Alfe no longer works out of the box-- at least not when I tried. Multiple issues – Joshua Wolff Aug 10 '20 at 20:20
  • @JoshWolff It can't be because it doesn't answer the question which contains the following aspect: »The file is too large to gzip it efficiently on disk prior to uploading.« This answer here answers a different question (without the mentioned restriction). But thanks pointing out that there are issues with my former solution. I'm not using it anymore, so I didn't know. You maybe should report on your findings in comments on the other answers so that other people could benefit from your work. – Alfe Aug 11 '20 at 00:11
  • @Alfe Ah, I see. I normally do add my findings but didn't have time to debug it. Just went with the solution that worked, and so I added my finding here. – Joshua Wolff Aug 11 '20 at 03:25
  • @JoshWolff How likely is it that people read your comment here when they are trying to use the solution I proposed in a different answer? ;-) And it might already be helpful to give some keywords on what didn't work anymore and how you approached it (e.g. »Python3 issues«, »newer boto needs keyword arguments«, etc.). And obviously that you didn't get it to running anymore, so people are warned about it. – Alfe Aug 11 '20 at 08:39
7

There really isn't a way to do this because S3 doesn't support true streaming input (i.e. chunked transfer encoding). You must know the Content-Length prior to upload and the only way to know that is to have performed the gzip operation first.

garnaat
  • 44,310
  • 7
  • 123
  • 103
  • Will the S3 upload really need to know the size of the value? That truly would mean that no streaming compression while storing could be performed. I'm going to check on this. – Alfe Apr 02 '13 at 15:09
  • There is a `set_contents_from_stream()` in the boto-s3-bucket-keys. That at least hints on that streaming should be possible, don't you think? – Alfe Apr 02 '13 at 15:15
  • 1
    From its documentation: `The stream object is not seekable and total size is not known. This has the implication that we can't specify the Content-Size and Content-MD5 in the header. So for huge uploads, the delay in calculating MD5 is avoided but with a penalty of inability to verify the integrity of the uploaded data.` – Alfe Apr 02 '13 at 15:16
  • The ``set_contents_from_stream`` method is supported only on Google Cloud Storage, not S3. – garnaat Apr 02 '13 at 15:26
  • Strange. I'm looking at `boto.s3.key.set_contents_from_stream()`. I'm going to try this out, just to be sure. Whatever the outcome, thanks for your consideration already! :) – Alfe Apr 02 '13 at 17:34
  • `boto.exception.BotoClientError: BotoClientError: s3 does not support chunked transfer`. Seems you're right. But there is a multipart-upload for S3 I'm using. I'm gonna have a look at this again. – Alfe Apr 02 '13 at 17:37
  • 3
    Yes, S3 supports multipart upload. But still, each part must be known before uploading. There is no support for streaming upload in S3. Breaking your huge file up into parts and using multipart sounds like a reasonable approach. – garnaat Apr 02 '13 at 18:56
  • Going to try that (i.e. gzipping it streamingly, chunking the outcome and multipart-upload the chunks). Thanks for your information. In case no better solution pops up, I will accept your answer as it contains an important aspect of S3 uploading. But for now I don't want to do this to avoid that people find this question less interesting because it has an accepted answer. – Alfe Apr 02 '13 at 20:16