0

I have a data science application that I need to run once every 2-3 hours, where I need to use 64 cores for 6 minutes in an embarrassingly parallel fashion. Each of the cores needs to load 3GB of data from disk for a total of 192GB of disk data.

To achieve this in a cost-effective way, my plan is to spin up a 64-core EC2 spot instance using a script whenever I need to run one of these jobs. I also plan to have a 200GB AMI with my required data. Then, when the EC2 instance starts, I can run my 64 jobs and they can each load their 3GB of data off the SSD there.

Will this work, and how long will it take to spin up the EC2 spot instance with the large AMI? If it takes multiple minutes to start the instance then that's not good since these are only 6 minute jobs that I want to run quickly. Is there a better way to achieve my workflow?

Jase
  • 1,025
  • 1
  • 9
  • 34
  • Have you Benchmarked this yourself yet? That would be the only way to know for sure. Depending on EC2 load and OS init could be 1 to 5 minutes. – Rodrigo Murillo Feb 21 '20 at 09:21
  • @RodrigoM How would someone in my position use the cloud to run jobs then? If it takes 1-5 minutes to start an instance and 6 minutes to run, I've increased my running time by 50%. – Jase Feb 21 '20 at 09:23
  • Then you have a 11 minute job at maximum. – Rodrigo Murillo Feb 21 '20 at 09:28
  • 1
    I don't know what position you are in. But you should build in boot time into your usage forecast. Benchmark then optimize. With that much cpu it should be fast. – Rodrigo Murillo Feb 21 '20 at 09:28
  • @RodrigoM Does it cost extra money to spin up an EC2 instance with an extremely large AMI? – Jase Feb 21 '20 at 09:51
  • Just the instance cost which for Linux is per second billing. You are paying for the AMI size per EBS storage, even if it's not launched. What is your instance type? – Rodrigo Murillo Feb 21 '20 at 14:00
  • m5.16xlarge with 64 cpus is $3.072/hours or 0.50 per 10 mins job. That is for on demand. Spot price currently is $0.6385 or 0.10 per 10 min job. So even with maximum boot time, at every 2 hours, it will cost $1 per day to run your jobs. – Rodrigo Murillo Feb 21 '20 at 14:12

1 Answers1

1

I ran a quick test on an m5n.16xlarge instance with 64 CPUs, with no additional storage. The instance booted immediately - like in the first 10 seconds. There should be no impact to boot time with the additional storage of a large EBS backed AMI.

I noted these instance types feature high performance, local storage of 600GB included in the AMI:

Local NVMe-based SSD block level storage physically connected to the host server is available on all M5d, M5dn, and M5ad instances. These instances are a great fit for applications that need access to high-speed, low latency local storage including those that need temporary storage of data for scratch space, temporary files, and caches.

You may consider moving your data to S3 and copying it down to local storage for processing. It would make the AMI independent of your data, should you need to change it more frequently. These large instances have 10GB of bandwidth allocated, so the data transfer should be fast. You would incur S3 transfer costs on each boot, however.

On the AMI launch configuration, make sure and use "general purpose SSD" or gp2 EBS volumes for instance root volumes - this is faster than the older HDD magnetic volume types.

https://aws.amazon.com/ec2/instance-types/m5/

Rodrigo Murillo
  • 13,080
  • 2
  • 29
  • 50
  • Thanks you for doing the test. I also did the test and noticed it booted immediately. I was using the gp SSD. However, data IO was extremely slow, I presume because the AMI was gradually being transferred to the local SSD. I am therefore interested in your solution "moving data to S3 and copying it...". How would I achieve this? – Jase Feb 22 '20 at 02:19