Prefered way of using Flask and S3 for large files

Question

I know that this is a little bit open ended but I am confused as to what strategy/method to apply for a large file upload service developed using Flask and boto3. For smaller files and all it is fine. But it would be really nice to see what you guys think when the size exceeds 100 MB

What I have in mind are following -

a) Stream the file to Flask app using some kind of AJAX uploader(What I am trying to build is just a REST interface using Flask-Restful. Any example of using these components, e.g. Flask-Restful, boto3 and streaming large files are welcome.). The upload app is going to be (I believe) part of a microservices platform that we are building. I do not know whether there will be a Nginx proxy in front of the flask app or it will be directly served from a Kubernetes pod/service. In case it is directly served, is there something that I have to change for large file upload either in kubernetes and/or Flask layer?

b) Using a direct JS uploader (like http://www.plupload.com/) and stream the file into s3 bucket directly and when finished get the URL and pass it to the Flask API app and store it in DB. The problem with this is, the credentials need to be there somewhere in JS which means a security threat. (Not sure if any other concerns are there)

What among them (or something different I did not think about at all) you think is the best way and where can I find some code example for that?

Thanks in advance.

[EDIT]

I have found this - http://blog.pelicandd.com/article/80/streaming-input-and-output-in-flask where the author is dealing with kind of similar situation like me and he proposed a solution. But he is opening a file already present in disk. What if I want to directly upload the file as it comes in as one single object in a s3 bucket? I feel that this can be a base of a solution but not the solution itself.

score 2 · Answer 1 · edited Jun 08 '20 at 19:58

Alternatively you can use Minio-py client library, its Open Source and compatible with S3 API. It handles multipart upload for you natively.

A simple put_object.py example:

import os

from minio import Minio
from minio.error import ResponseError

client = Minio('s3.amazonaws.com',
               access_key='YOUR-ACCESSKEYID',
               secret_key='YOUR-SECRETACCESSKEY')

# Put a file with default content-type.
try:
    file_stat = os.stat('my-testfile')
    file_data = open('my-testfile', 'rb')
    client.put_object('my-bucketname', 'my-objectname', file_data, file_stat.st_size)
except ResponseError as err:
    print(err)

# Put a file with 'application/csv'
try:
    file_stat = os.stat('my-testfile.csv')
    file_data = open('my-testfile.csv', 'rb')
    client.put_object('my-bucketname', 'my-objectname', file_data,
                      file_stat.st_size, content_type='application/csv')
except ResponseError as err:
    print(err)

You can find list of complete API operations with examples here

Installing Minio-Py library

$ pip install minio

Hope it helps.

Disclaimer: I work for Minio

minio looks promising. Thanks for the pointer. Love the fact that it is written in Go. I love Go. — SRC, May 21 '16 at 12:32

score 0 · Answer 2 · answered May 20 '16 at 08:54

0

Flask can only use the memory to save all http request body, so there is no feature such as disk buffing as I know.
Nginx upload module is a really good way to do large file upload. the document is here.
You can also use html5, flash to send trunked file data and process the data in Flask, but it's complicated.
Try to look up if s3 offer the one time token.

answered May 20 '16 at 08:54

Yuan Wang

145
4

What do you think of the solution I proposed bellow? – SRC May 20 '16 at 20:49

score 0 · Accepted Answer · answered May 20 '16 at 12:52

Using the link I have posted above I finally ended up doing the following. Please tell me if you think it is a good solution

import boto3
from flask import Flask, request

.
.
.

@app.route('/upload', methods=['POST'])
def upload():
    s3 = boto3.resource('s3', aws_access_key_id="key", aws_secret_access_key='secret', region_name='us-east-1')
    s3.Object('bucket-name','filename').put(Body=request.stream.read(CHUNK_SIZE))
.
.
.

Daniel Olson · Answer 4 · 2023-04-11T17:41:26.963

So I found an option here to actually uploads in pieces using boto3

here is an example of the below functions using flask. (It's an untested example made to explain how it works, not for production or anything)

my_save_files = {}
@app.route('/upload/stream', methods=['GET', 'POST'])
def upload_stream():
    if 'i' not in request.headers \
            or 'len' not in request.headers:
            return 'fail'
    for fn in request.files:
        index = int(request.headers['i'])
        length = int(request.headers['len'])
        if fn == '':
            return 'fail'
        if fn not in my_save_files:
            my_save_files[fn] = {'parts': [], 'id': s3.create_muiltipart_upload(fn)}
        file = request.files[fn]
        s3.multi_upload_part(
            fn,
            my_save_files[fn]['id'],
            my_save_files[fn]['parts'],
            file.read(),
            index + 1  # parts start at 1
        )
        if index == length - 1:
            s3.complete_multi_part_upload(fn, my_save_files[fn]['id'], my_save_files[fn]['parts'])
        return 'sucess'
    return 'fail'

here is the sample code it uses boto3 to handle S3's multipart upload


r3 = boto3.resource('s3')
c3 = boto3.client('s3')

def create_muiltipart_upload(key):
    multipart_upload = c3.create_multipart_upload(
        # ACL='public-read',
        Bucket=bucket_name,  # 'bucket_name',
        # ContentType='video/mp4',
        Key=key,  # 'movie.mp4',
    )
    return multipart_upload['UploadId']

def multi_upload_part(key, upload_id, parts, piece, part_number):
    uploadPart = r3.MultipartUploadPart(
        bucket_name, key, upload_id, part_number
    )
    uploadPartResponse = uploadPart.upload(
        Body=piece,
    )
    parts.append({
        'PartNumber': part_number,
        'ETag': uploadPartResponse['ETag']
    })

def complete_multi_part_upload(key, upload_id, parts):
    completeResult = c3.complete_multipart_upload(
        Bucket=bucket_name,  # 'multipart-using-boto',
        Key=key,
        MultipartUpload={
            'Parts': parts
        },
        UploadId=upload_id,
    )
    return completeResult

basic usage:

# https://blog.filestack.com/tutorials/amazon-s3-multipart-uploads-python-tutorial/
def multi_part_upload(file_path, key):
    parts = []
    i = 1  # part numbers start at 1

    upload_id = create_muiltipart_upload(key)

    with open(file_path, 'rb') as f:
        while True:
            piece = f.read(524288)  # 0.5 mb == 1024**2 / 2
            if piece == b'':
                break
            multi_upload_part(key, upload_id, parts, piece, i)
            i += 1

    print(complete_multi_part_upload(key, upload_id, parts))

not mentioned and something I haven't done yet is deleting files that aren't completed. Because I think I read in the docs that their technically not deleted. If anyone wants to chime in I'm all ears. Googling you can abort a multi part upload

Prefered way of using Flask and S3 for large files

4 Answers4