What is the easiest way to launch parallel jobs on AWS?

Question

My use case is as follows:

I have a python script which:

1. reads a file from S3
2. processes the file and outputs a new file
3. saves the output file to S3 (or maybe a database)

The python script has some dependencies which are managed via virtualenv.

What is the recommended/easiest way of running these scripts in parallel on AWS?

I see the following options:

AWS Batch: Looks really complicated - I have to build my own Docker container, set up 3 different users, it's not easy to debug.
AWS Lambda: A bit easier to set up, but I still have to wrap my script up into a Lambda function. Debugging doesn't seem too straightforward
Slurm on manually spun up EC2 instances - From a user perspective, this is ideal - all I would have to do is just create a jobs.sbatch file which loads the virtualenv and runs the script. The main drawback is that I have to install and configure slurm.

What is the recommended way of handing this workflow?

Rather than "running scripts in parallel", have you considered triggering the script whenever a new file is uploaded to S3? This means the data will be processed as the data arrives, rather than in batches later. — John Rotenstein, Jun 28 '19 at 01:16
How long does it take to process each file? How many files do you need to process over what time period, and how often? — John Rotenstein, Jun 28 '19 at 01:17

score 1 · Answer 1 · edited Jun 28 '19 at 01:14

1

I think you can use a publish/subscribe mechanism by using an SQS queue containing the object key to work on. Then you can have a group of EC2 instances or ECS each subscribing the queue and performing the single operation. With the queue you ensure each process work on a single instance of the problem. I think it is possible to create an auto scaling group in ECS and you probably can change the number of machines to tune the performance/cost.

edited Jun 28 '19 at 01:14

John Rotenstein

241,921
22
380
470

answered Jun 27 '19 at 12:57

Felice Pollano

32,832
9
75
115

1

Solid advice from a 10 year old. More seriously, this means I have to provision the EC2 instances manually (which is not terrible) - can I auto-scale capacity? – Henry Henrinson Jun 27 '19 at 13:01
1

@HenryHenrinson I thionk you can use ECS autoscaling https://docs.aws.amazon.com/AmazonECS/latest/developerguide/service-auto-scaling.html – Felice Pollano Jun 27 '19 at 13:05
1

@HenryHenrinson you can auto-scale ECS or EC2 instances based on the number of messages waiting in the SQS queue. If you can run your code in AWS Lambda then even the auto-scaling would be handled for you, all you would need to do is associate the Lambda function with the SQS queue. – Mark B Jun 27 '19 at 15:03

score 1 · Answer 2 · answered Jun 27 '19 at 15:54

1

Lambda will be suitable for you because you won't have to look at scaling nor get into setting up all the things. About the debugging, you can easily do it using sls wsgi serve

answered Jun 27 '19 at 15:54

sc0rp1on

348
1
15

What is the easiest way to launch parallel jobs on AWS?

2 Answers2