-2

My use case is as follows:

I have a python script which:

1. reads a file from S3
2. processes the file and outputs a new file
3. saves the output file to S3 (or maybe a database)

The python script has some dependencies which are managed via virtualenv.

What is the recommended/easiest way of running these scripts in parallel on AWS?

I see the following options:

  1. AWS Batch: Looks really complicated - I have to build my own Docker container, set up 3 different users, it's not easy to debug.
  2. AWS Lambda: A bit easier to set up, but I still have to wrap my script up into a Lambda function. Debugging doesn't seem too straightforward
  3. Slurm on manually spun up EC2 instances - From a user perspective, this is ideal - all I would have to do is just create a jobs.sbatch file which loads the virtualenv and runs the script. The main drawback is that I have to install and configure slurm.

What is the recommended way of handing this workflow?

Henry Henrinson
  • 5,203
  • 7
  • 44
  • 76
  • Rather than "running scripts in parallel", have you considered triggering the script whenever a new file is uploaded to S3? This means the data will be processed as the data arrives, rather than in batches later. – John Rotenstein Jun 28 '19 at 01:16
  • How long does it take to process each file? How many files do you need to process over what time period, and how often? – John Rotenstein Jun 28 '19 at 01:17

2 Answers2

1

I think you can use a publish/subscribe mechanism by using an SQS queue containing the object key to work on. Then you can have a group of EC2 instances or ECS each subscribing the queue and performing the single operation. With the queue you ensure each process work on a single instance of the problem. I think it is possible to create an auto scaling group in ECS and you probably can change the number of machines to tune the performance/cost.

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
Felice Pollano
  • 32,832
  • 9
  • 75
  • 115
  • 1
    Solid advice from a 10 year old. More seriously, this means I have to provision the EC2 instances manually (which is not terrible) - can I auto-scale capacity? – Henry Henrinson Jun 27 '19 at 13:01
  • 1
    @HenryHenrinson I thionk you can use ECS autoscaling https://docs.aws.amazon.com/AmazonECS/latest/developerguide/service-auto-scaling.html – Felice Pollano Jun 27 '19 at 13:05
  • 1
    @HenryHenrinson you can auto-scale ECS or EC2 instances based on the number of messages waiting in the SQS queue. If you can run your code in AWS Lambda then even the auto-scaling would be handled for you, all you would need to do is associate the Lambda function with the SQS queue. – Mark B Jun 27 '19 at 15:03
1

Lambda will be suitable for you because you won't have to look at scaling nor get into setting up all the things. About the debugging, you can easily do it using sls wsgi serve

sc0rp1on
  • 348
  • 1
  • 15