0

I have a CPU-intensive data-processing application that I want to run across many (~100,000) input files. The application needs a large (~20GB) data file in order to run. What I would like to do is

  • create an EC2 machine image that has my application and associated data files installed
  • boot up a large number (e.g. 100) of instances of this image
  • split my input files up into 100 batches and send one batch to be processed on each instance

I am having trouble figuring out the best way to ensure that each instance has access to the large data file. The data file is too big to fit on the root filesystem of an AMI. I could use Block Storage, but a given Block Storage volume can only be attached to a single instance, so I would need 100 clones.

Is there some way to create a custom image that has more space on the root filsystem so that I can include my large data file? Or is there a better way to tackle this problem?

mojones
  • 165
  • 1
  • 3
  • 8

3 Answers3

4

Put your data file in S3.

Create a "master" EBS-based instance:

  1. Mount the instance storage during instance creation
  2. On boot, copy the large data file to the instance volume
  3. Process the data locally

Create an AMI of your instance and launch 100 instances from your AMI.

The benefits of this is that each instance has it's own local copy of the data and you won't spend money storing multiple copies of the data on EBS volumes. The drawback will be the time to copy the data file from S3 on launch, but once copied, processing the file should be fast since it will be local.

Matt Houser
  • 10,053
  • 1
  • 28
  • 28
3

If the data is fairly unchanging, put it in an EBS volume and make a snapshot of it. When you start each new node, have it create a new volume based on the snapshot and mount it. Making the snapshot is a fairly slow process, but creating volumes based on the snapshot is surprisingly quick!

If your data changes a bit, putting it into S3 is a simpler process to maintain and hundreds of nodes can pull the data at once without a noticeable degradation in speed (compared to just a single node pulling down data). Overall this will be slower than the EBS method above, but it'll be simpler to implement and maintain.

  • Good answers all - marking this as best as it's what I decided to do for simplicity, but I will revisit this question when I have a bit more experience with EC2. – mojones Sep 21 '12 at 07:32
2

Options:

  1. Use an S3 bucket to store your input data. Mount it on multiple worker instances.

  2. Create a "master" instance that shares the input files from your EBS volume (e.g. via NFS) with your worker instances.

Skyhawk
  • 14,200
  • 4
  • 53
  • 95
  • Will get good performance with ~100 instances all reading from the same S3 bucket at the same time? I had originally thought that it would be quicker for each instance to have a "local" copy of the necessary data. – mojones Sep 19 '12 at 14:52
  • 1
    Yes, S3 is designed to be used in this way. Amazon provides [recommendations for optimal performance](http://aws.amazon.com/articles/1904). – Skyhawk Sep 19 '12 at 15:16