3

I have a python web application written in bottlepy. Its only purpose is to allow people to upload large files that will be processed (takes approximately 10-15 minutes to process).

The upload code i rather simple:

@route('/upload', method='POST')
def upload_file():
  uploadfile = request.files.get('fileToUpload')
  if not uploadfile:
    abort(500, 'No file selected for upload')

  name,ext = os.path.splitext(uploadfile.filename)

  if ext not in ['.zip','.gz']:
    abort(500, 'File extension not allowed')

  try:
    uploadfile.save('./files')

    process_file(uploadfile.filename) #this function is not yet implemented

    return "uploaded file '%s' for processing" % uploadfile.filename
  except IOError as e:
    abort(409, "File already exists.")

I plan to deploy this application using uWSGI (however, if other technology is better for the purpose its not set in stone.

Because of this I have some questions regarding the use of uWSGI for such a purpose:

  1. If the file upload takes minutes, how will uWSGI be capable of handling other clients without blocking?
  2. Is there any way the processing can be offloaded using built in functionality in uWSGI so that the user get a response after upload and can query for processing status?

Thank you for any help.

agnsaft
  • 1,791
  • 7
  • 30
  • 49

1 Answers1

8

If the file upload takes minutes, how will uWSGI be capable of handling other clients without blocking?

It will block. A solution is to put a webserver like NGINX in front of uWSGI that pre-buffers the POST request. So the file upload will be actually bound to an NGINX handler until is completed and then passed to the uWSGI handler.

Is there any way the processing can be offloaded using built in functionality in uWSGI so that the user get a response after upload and can query for processing status?

You need to create a task queue system to offload the processing from the web handler. This is a common best practice. Just look around for python task queues. For builtin functionalities it really depends on the task you need to offload. You can use the builtin uWSGI spooler, or the uWSGI mules. These are very good alternatives to a typical task queue (like the very famous Celery) but have limitations. Just try it yourself in your scenario.

Paolo Casciello
  • 7,982
  • 1
  • 43
  • 42
  • Is this a nginx option (and on by default) or a uwsgi option? I believe I found some references to this when using nginx as a proxy, but found nothing when using it for uwsgi_pass. – agnsaft Aug 15 '13 at 16:50
  • @invictus is `enabled` by default. I don't think is even possible to disable it. It's especially useful in proxy. And uWSGI is an upstream proxy for Nginx :) – Paolo Casciello Aug 15 '13 at 16:54
  • uWSGI have some interesting sounding features, like spooler, mules, offloading subsystem, and queue framework. Can neither of these be used for offloading processing tasks? – agnsaft Aug 15 '13 at 21:09
  • yes they can, but you are talking about uploading so generally you need some form of synchronous processing. By the way as long as you have a buffering proxy on front (like nginx) there will be no problems on this area (uWSGI will be wake up only when the whole file is uploaded) – roberto Aug 16 '13 at 04:19
  • @roberto and the file does not need to be copied over to uWSGI, only the open file handler is transferred? – agnsaft Aug 16 '13 at 06:38
  • @roberto By the way, uploading is only the first part of the puzzle, the processing is a heavy processing of the file by python code. – agnsaft Aug 16 '13 at 06:45
  • @invictus no. the whole file content is passed to the uwsgi handler. That's how wsgi works and that's because the real power of uwsgi workers is they can sit on different servers in a cluster. :) Using mules or uwsgi spoolers really depends on the task itself. For a simple file analysis probably is enough. But sometimes a dedicated task queue is better. There isn't a rule of thumb. – Paolo Casciello Aug 16 '13 at 12:08
  • @PaoloCasciello I am curious about Celery. I was hoping to reduce dependencies to a minimum thus only use the uWSGI features, but I will look into Celery. – agnsaft Aug 17 '13 at 10:35
  • @invictus yeah thais is the tradeoff: using a distributed task queue system full featured like celery adding another PoF to the architecture to monitor/manage or using a less featured simpler (but powerful neverthless) component already present in the architecture. :) Remember to accept the answer if it was useful. ;) – Paolo Casciello Aug 17 '13 at 13:25