2

UPDATE: (5/18/2020) Solution at the end of this post!

I'm attempting to upload big CSV files (30MB - 2GB) from a browser to GCP App Engine running Python 3.7 + Flask, and then push those files to GCP Storage. This works fine on local testing with large files, but errors out immediately on GCP with a "413 - Your client issued a request that was too large" if the file is larger than roughly 20MB. This error happens instantly on upload before it even reaches my custom Python logic (I suspect App Engine is checking the Content-Length header). I tried many solutions after lots of SO/blog research to no avail. Note that I am using the basic/free App Engine setup with the F1 instance running the Gunicorn server.

First, I tried setting app.config['MAX_CONTENT_LENGTH'] = 2147483648 but that didn't change anything (SO post). My app still threw an error before it even reached my Python code:

# main.py
    app.config['MAX_CONTENT_LENGTH'] = 2147483648   # 2GB limit

    @app.route('/', methods=['POST', 'GET'])
    def upload():
        # COULDN'T GET THIS FAR WITH A LARGE UPLOAD!!!
        if flask.request.method == 'POST':

            uploaded_file = flask.request.files.get('file')

            storage_client = storage.Client()
            storage_bucket = storage_client.get_bucket('my_uploads')

            blob = storage_bucket.blob(uploaded_file.filename)
            blob.upload_from_string(uploaded_file.read())

<!-- index.html -->
    <form method="POST" action='/upload' enctype="multipart/form-data">
        <input type="file" name="file">
    </form>

After further research, I switched to chunked uploads with Flask-Dropzone, hoping I could upload the data in batches then append/build-up the CSV files as a Storage Blob:

# main.py
app = flask.Flask(__name__)
app.config['MAX_CONTENT_LENGTH'] = 2147483648   # 2GB limit
dropzone = Dropzone(app)


@app.route('/', methods=['POST', 'GET'])
def upload():

    if flask.request.method == 'POST':

        uploaded_file = flask.request.files.get('file')

        storage_client = storage.Client()
        storage_bucket = storage_client.get_bucket('my_uploads')

        CHUNK_SIZE = 10485760  # 10MB
        blob = storage_bucket.blob(uploaded_file.filename, chunk_size=self.CHUNK_SIZE)

        # hoping for a create-if-not-exists then append thereafter
        blob.upload_from_string(uploaded_file.read())

And the JS/HTML is straight from a few samples I found online:

    <script>
       Dropzone.options.myDropzone = {
       timeout: 300000,
       chunking: true,
       chunkSize: 10485760 };
    </script>
    ....
    <form method="POST" action='/upload' class="dropzone dz-clickable" 
      id="dropper" enctype="multipart/form-data">
    </form>

The above does upload in chunks (I can see repeated calls to POST /upload), but, the call to blob.upload_from_string(uploaded_file.read()) just keeps replacing the blob contents with the last chunk uploaded instead of appending. This also doesn't work even if I strip out the chunk_size=self.CHUNK_SIZE parameter.

Next I looked at writing to /tmp then to Storage but the docs say writing to /tmp takes up the little memory I have, and the filesystem elsewhere is read-only, so neither of those will work.

Is there an append API or approved methodology to upload big files to GCP App Engine and push/stream to Storage? Given the code works on my local server (and happily uploads to GCP Storage), I'm assuming this is a built-in limitation in App Engine that needs to be worked around.


SOLUTION (5/18/2020) I was able to use Flask-Dropzone to have JavaScript split the upload into many 10MB chunks and send those chunks one at a time to the Python server. On the Python side of things we'd keep appending to a file in /tmp to "build up" the contents until all chunks came in. Finally, on the last chunk we'd upload to GCP Storage then delete the /tmp file.

@app.route('/upload', methods=['POST'])
def upload():

    uploaded_file = flask.request.files.get('file')

    tmp_file_path = '/tmp/' + uploaded_file.filename
    with open(tmp_file_path, 'a') as f:
        f.write(uploaded_file.read().decode("UTF8"))

    chunk_index = int(flask.request.form.get('dzchunkindex')) if (flask.request.form.get('dzchunkindex') is not None)  else 0
    chunk_count = int(flask.request.form.get('dztotalchunkcount')) if (flask.request.form.get('dztotalchunkcount') is not None)  else 1

    if (chunk_index == (chunk_count - 1)):
        print('Saving file to storage')
        storage_bucket = storage_client.get_bucket('prairi_uploads')
        blob = storage_bucket.blob(uploaded_file.filename) #CHUNK??

        blob.upload_from_filename(tmp_file_path, client=storage_client)
        print('Saved to Storage')

        print('Deleting temp file')
        os.remove(tmp_file_path)
<!-- index.html -->
        <script>
          Dropzone.options.myDropzone = {
          ... // configs
          timeout: 300000,
          chunking: true,
          chunkSize: 1000000
        };
        </script>

Note that /tmp shares resources with RAM, so you need at least as much RAM as the as the uploaded file size, plus more for Python itself (I had to use an F4 instance). I would imagine there's a better solution to write to block storage instead of /tmp, but I haven't gotten that far yet.

P_impl55
  • 148
  • 2
  • 12

1 Answers1

2

The answer is that you cannot upload or download files larger than 32 MB in a single HTTP request. Source

You either need to redesign your service to transfer data in multiple HTTP requests, transfer data directly to Cloud Storage using Presigned URLs, or select a different service that does NOT use the Global Front End (GFE) such as Compute Engine. This excludes services such as Cloud Functions, Cloud Run, App Engine Flexible.

If you use multiple HTTP requests, you will need to manage memory as all temporary files are stored in memory. This means you will have issues as you approach the maximum instance size of 2 GB.

John Hanley
  • 74,467
  • 6
  • 95
  • 159
  • The Presigned URL feature looks interesting; I will give it a closer look! With regard to splitting the data into multiple HTTP requests, what if I reused the same batch/chunking mechanism browser-side and split a large file into N individual uploads to Storage, then merge them together using something like `gsutil compose`? For example, a 120MB file can be uploaded as 6 batches of 20MB each into Storage, like file_1, file_2, file_6 and then merged up by calling `gsutil compose` within the Python logic (assuming that's even possible)? – P_impl55 May 17 '20 at 22:16
  • @P_impl55 Yes, you can use multiple HTTP requests (not a single request that is chunked). App Engine is not an operating system, so you cannot use programs like gsutil. You can combine and upload to Cloud Storage in your code. – John Hanley May 17 '20 at 22:42
  • John - I appreciate your help! A follow-through: if I can't use `gsutil` then what can I use to combine individually uploaded files from multiple HTTP requests? I don't see a GCP Storage API to merge, append, or work with byte buffers/streams like I would normally expect. I was hoping there'd be something `blob.write(file, offset, length, bytes)` where I can stitch together files, but that doesn't exist from what I've seen. – P_impl55 May 17 '20 at 22:53
  • I am not aware of any libraries, but there should be some. I just use the REST API and read the files and write to Cloud Storage. The merge happens in my code. Think Create object, read the first file, write to object, read the second file, write to object, repeat, close object. Do not close/end the HTTP data steam when writing to Cloud Storage. Note the stream is an HTTP PUT request and you are continuously writing data as one request for each file (the fragments). – John Hanley May 17 '20 at 23:05
  • @P_impl55 - Create new questions as the original question has been answered. – John Hanley May 17 '20 at 23:06