6

I'm trying to upload a file in my s3 bucket using a pre-signed URL, it works perfectly and uploads the data to the bucket successfully, however, the files that I upload are very large and I need to be able to show the progress bar. I have tried many solutions available on StackOverflow and other blog posts but nothing seems to be helping.

Following is the code snippet that uploads the data to s3 using a pre-signed URL.

object_name = 'DataSet.csv'
response = create_presigned_post("mybucket_name",object_name)

fields = response['fields']
with open(object_name, 'rb') as f:
    files = {'file': (object_name, f)}
    http_response = requests.post(response['url'], data=fields, files=files,stream=True)

print (http_response.status_code)

it returns the 204 status which is for a successful upload.

Now, what changes I can make to this code to show the progress bar.

P.S I have tried stream=True in requests not working. I have tried iterating over the response using tqdm but it not works in that case also.

Mohsin Ashraf
  • 972
  • 12
  • 18

2 Answers2

7

I don't think there is a way to do so by uploading a large file using presignedUrl with the default protocol of HTTP POST request. You can achieve that by working with multipart upload mechanism of AWS S3. This way you can be aware of each part that was uploaded and calculate the progression according to that. I created a post with code snippets of working with multipart upload and presignedUrl (typescript) https://www.altostra.com/blog/multipart-uploads-with-s3-presigned-url

Shahar Yakov
  • 418
  • 3
  • 10
  • You don't seem to be able to apply conditions to this like you can with the presigned post request so anyone with access to these endpoints could upload 5 TB objects to your storage :(( Is the multipart upload overkill for file uploads capped at 50 MB? – nrmad Feb 04 '21 at 17:07
2

The following code would work fine for Python, I found it here

import logging
import argparse

from boto3 import Session
import requests


logging.basicConfig()
logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)


class S3MultipartUploadUtil:
    """
    AWS S3 Multipart Upload Uril
    """
    def __init__(self, session: Session):
        self.session = session
        self.s3 = session.client('s3')
        self.upload_id = None
        self.bucket_name = None
        self.key = None

    def start(self, bucket_name: str, key: str):
        """
        Start Multipart Upload
        :param bucket_name:
        :param key:
        :return:
        """
        self.bucket_name = bucket_name
        self.key = key
        res = self.s3.create_multipart_upload(Bucket=bucket_name, Key=key)
        self.upload_id = res['UploadId']
        logger.debug(f"Start multipart upload '{self.upload_id}'")

    def create_presigned_url(self, part_no: int, expire: int=3600) -> str:
        """
        Create pre-signed URL for upload part.
        :param part_no:
        :param expire:
        :return:
        """
        signed_url = self.s3.generate_presigned_url(
            ClientMethod='upload_part',
            Params={'Bucket': self.bucket_name,
                    'Key': self.key,
                    'UploadId': self.upload_id,
                    'PartNumber': part_no},
            ExpiresIn=expire)
        logger.debug(f"Create presigned url for upload part '{signed_url}'")
        return signed_url

    def complete(self, parts):
        """
        Complete Multipart Uploading.
        `parts` is list of dictionary below.
        ```
        [ {'ETag': etag, 'PartNumber': 1}, {'ETag': etag, 'PartNumber': 2}, ... ]
        ```
        you can get `ETag` from upload part response header.
        :param parts: Sent part info.
        :return:
        """
        res = self.s3.complete_multipart_upload(
            Bucket=self.bucket_name,
            Key=self.key,
            MultipartUpload={
                'Parts': parts
            },
            UploadId=self.upload_id
        )
        logger.debug(f"Complete multipart upload '{self.upload_id}'")
        logger.debug(res)
        self.upload_id = None
        self.bucket_name = None
        self.key = None


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('target_file')
    parser.add_argument('--bucket', required=True)
    args = parser.parse_args()

    target_file = Path(args.target_file)
    bucket_name = args.bucket
    key = target_file.name
    max_size = 5 * 1024 * 1024

    file_size = target_file.stat().st_size
    upload_by = int(file_size / max_size) + 1

    session = Session()
    s3util = S3MultipartUploadUtil(session)

    s3util.start(bucket_name, key)
    urls = []
    for part in range(1, upload_by + 1):
        signed_url = s3util.create_presigned_url(part)
        urls.append(signed_url)

    parts = []
    with target_file.open('rb') as fin:
        for num, url in enumerate(urls):
            part = num + 1
            file_data = fin.read(max_size)
            print(f"upload part {part} size={len(file_data)}")
            res = requests.put(url, data=file_data)
            print(res)
            if res.status_code != 200:
                return
            etag = res.headers['ETag']
            parts.append({'ETag': etag, 'PartNumber': part})

    print(parts)
    s3util.complete(parts)


if __name__ == '__main__':
    main()
Mohsin Ashraf
  • 972
  • 12
  • 18