use boto for gzipping files instead of sfs3

Question

import contextlib
import gzip

import s3fs

AWS_S3 = s3fs.S3FileSystem(anon=False) # AWS env must be set up correctly

source_file_path = "/tmp/your_file.txt"
s3_file_path = "my-bucket/your_file.txt.gz"

with contextlib.ExitStack() as stack:
    source_file = stack.enter_context(open(source_file_path , mode="rb"))
    destination_file = stack.enter_context(AWS_S3.open(s3_file_path, mode="wb"))
    destination_file_gz = stack.enter_context(gzip.GzipFile(fileobj=destination_file))
    while True:
        chunk = source_file.read(1024)
        if not chunk:
            break
        destination_file_gz.write(chunk)

I was trying to run something like this on an AWS Lambda function but it throws an error because It Is unable to install the s3fs module. Plus, I am using boto for the remaining parts of my code so I would like to stick to boto. How I can use boto for this too?

Basically, I am opening/reading a file from a '/tmp/path', gzipping it and then saving to an S3 bucket

Edit:

s3_resource = boto3.resource('s3')
bucket = s3_resource.Bucket('testunzipping')
s3_filename = 'samplefile.csv.'
      
   for i in testList:
        #zip_ref.open(i, ‘r’)
        with contextlib.ExitStack() as stack:
            source_file = stack.enter_context(open(i , mode="rb"))
            destination_file = io.BytesIO()
            destination_file_gz = stack.enter_context(gzip.GzipFile(fileobj=destination_file, mode='wb'))
            while True:
                chunk = source_file.read(1024)
                if not chunk:
                    break
                destination_file_gz.write(chunk)
            destination_file.seek(0)
            
            fileName = i.replace("/tmp/DataPump_10000838/", "") 
            bucket.upload_fileobj(destination_file, fileName)

Each item in the testList looks like this "/tmp/your_file.txt"

score 1 · Accepted Answer · answered Oct 20 '21 at 07:34

1

AWS Lambda function but it throws an error because It Is unable to install the s3fs module

Additional packages and your own lib code (reusable code) should be put in lambda layers.

How I can use boto for this too?

s3 = boto3.resource("s3")
bucket = s3.Bucket(bucket_name)

Then either:

If you have your file in memory (file-like object, open in bytes mode, e.g. io.BytesIO or just open(..., 'b'))

bucket.upload_fileobj(fileobj, s3_filename)

Or if you have a file in your current space:

bucket.upload_file(filepath, s3_filename)

https://boto3.amazonaws.com/v1/documentation/api/1.18.53/reference/services/s3.html#S3.Bucket.upload_file

answered Oct 20 '21 at 07:34

h4z3

5,265
1
15
29

I know about upload_file but how do I open the file and then convert to gzip in the first place? How can I open it and gzip it? Dk how to open it in current space or in bytes mode? – x89 Oct 20 '21 at 08:45
Oh. You can use io.BytesIO as destination "file" in your code (then remember to go to the beginning of it before uploading fileobj, using `.seek(0)`). Or use https://docs.python.org/3/library/gzip.html#gzip.compress – h4z3 Oct 20 '21 at 08:48
1. You changed too much from your original code, you don't have gzipping in there. 2. For your error: BytesIO() <- parens to make it an object. 3. `destination_file.seek(0)` after gzipping but before uploading (you have to "rewind" the file-like object to the beginning so the whole content can be seen) – h4z3 Oct 20 '21 at 09:02
You had while True previous when you read and wrote chunks. You don't have it now, so it's empty. ;) – h4z3 Oct 20 '21 at 09:23
As I said, seek should be after gzipping (writing, your loop) and before uploading. Right now it's before gzipping. – h4z3 Oct 20 '21 at 13:52
Ah I see I was missing the chunk writing part. Anyhow, the current code gives me ```[Errno 9] write() on read-only GzipFile object",``` – x89 Oct 20 '21 at 14:09
Ah. Yes. You didn't give a write mode to the file. `gzip.GzipFile(fileobj=destination_file, mode='wb')` - https://docs.python.org/3/library/gzip.html#gzip.GzipFile – h4z3 Oct 20 '21 at 14:18
```"errorMessage": "Negative seek in write mode",```:( – x89 Oct 20 '21 at 14:22
Really not sure about the order of commands you're suggesting...maybe you can take a look at it when feeling fresh again XD – x89 Oct 20 '21 at 14:38
What I mean: `destination_file_gz.seek(0)` -> `destination_file.seek(0)` and `bucket.upload_fileobj(destination_file_gz, fileName)` -> `bucket.upload_fileobj(destination_file, fileName)` | Because _gz one is only in write mode. And it's not our original file-like object we wanted. – h4z3 Oct 20 '21 at 14:41
```I/O operation on closed file.```:( – x89 Oct 20 '21 at 14:56

use boto for gzipping files instead of sfs3

1 Answers1