How can I create multiple identical AWS EC2 server instances with large amounts of persistent data?

Question

I have a CPU-intensive data-processing application that I want to run across many (~100,000) input files. The application needs a large (~20GB) data file in order to run. What I would like to do is

create an EC2 machine image that has my application and associated data files installed
boot up a large number (e.g. 100) of instances of this image
split my input files up into 100 batches and send one batch to be processed on each instance

I am having trouble figuring out the best way to ensure that each instance has access to the large data file. The data file is too big to fit on the root filesystem of an AMI. I could use Block Storage, but a given Block Storage volume can only be attached to a single instance, so I would need 100 clones.

Is there some way to create a custom image that has more space on the root filsystem so that I can include my large data file? Or is there a better way to tackle this problem?

score 4 · Answer 1 · answered Sep 19 '12 at 15:06

Put your data file in S3.

Create a "master" EBS-based instance:

Mount the instance storage during instance creation
On boot, copy the large data file to the instance volume
Process the data locally

Create an AMI of your instance and launch 100 instances from your AMI.

The benefits of this is that each instance has it's own local copy of the data and you won't spend money storing multiple copies of the data on EBS volumes. The drawback will be the time to copy the data file from S3 on launch, but once copied, processing the file should be fast since it will be local.

score 3 · Accepted Answer · answered Sep 19 '12 at 18:25

If the data is fairly unchanging, put it in an EBS volume and make a snapshot of it. When you start each new node, have it create a new volume based on the snapshot and mount it. Making the snapshot is a fairly slow process, but creating volumes based on the snapshot is surprisingly quick!

If your data changes a bit, putting it into S3 is a simpler process to maintain and hundreds of nodes can pull the data at once without a noticeable degradation in speed (compared to just a single node pulling down data). Overall this will be slower than the EBS method above, but it'll be simpler to implement and maintain.

Good answers all - marking this as best as it's what I decided to do for simplicity, but I will revisit this question when I have a bit more experience with EC2. — mojones, Sep 21 '12 at 07:32

Skyhawk · Answer 3 · 2012-09-19T14:55:36.983

2

Options:

Use an S3 bucket to store your input data. Mount it on multiple worker instances.
Create a "master" instance that shares the input files from your EBS volume (e.g. via NFS) with your worker instances.

edited Sep 19 '12 at 14:55

answered Sep 19 '12 at 14:49

Skyhawk

14,200
4
53
95

Will get good performance with ~100 instances all reading from the same S3 bucket at the same time? I had originally thought that it would be quicker for each instance to have a "local" copy of the necessary data. – mojones Sep 19 '12 at 14:52
1

Yes, S3 is designed to be used in this way. Amazon provides [recommendations for optimal performance](http://aws.amazon.com/articles/1904). – Skyhawk Sep 19 '12 at 15:16

How can I create multiple identical AWS EC2 server instances with large amounts of persistent data?

3 Answers3