Running multi-node parallel job in AWS Batch using R

Question

I am trying to build a multi-node parallel job in AWS Batch running an R script. My R script runs independently multiple statistical models for multiple users. Hence, I want to split and distribute this job running on parallel on a cluster of several servers for faster execution. My understanding is that at some point I have to prepare a containerized version of my R-application code using a Dockerfile pushed to ECR. My question is:

The parallel logic should be placed inside the R code, while using 1 Dockerfile? If yes, how does Batch know how to split my job (in how many chunks) ?? Is the for-loop in the Rcode enough?
or I should define the parallel logic somewhere in the Dockerfile saying that: container1 run the models for user1-5, container2 run the models for user6-10, etc.. ??

Could you please share some ideas or code on that topic for better understanding? Much appreciated.

Angel Pizarro · Answer 1 · 2023-02-03T17:09:21.877

AWS Batch does not inspect or change anything in your container, it just runs it. So you would need to handle the distribution of the work within the container itself.

Since these are independent processes (they don't communicate with each other over MPI, etc) you can leverage AWS Batch Array Jobs. Batch MNP jobs are for tightly-coupled workloads that need that inter-instance or inter-GPU communication using Elastic Fabric Adapter.

Your application code in the container can leverage the AWS_BATCH_JOB_ARRAY_INDEX environment variable to process a subset of users. AWS_BATCH_JOB_ARRAY_INDEX starts with 0 so you would need to account for that.

You can see an example in the AWS Batch docs for how to use the index.

Note that AWS_BATCH_JOB_ARRAY_INDEX is zero based, so you will need to account for that if your user numbering / naming scheme is different.

Running multi-node parallel job in AWS Batch using R

1 Answers1