Estimating computation costs for parallel computing

Question

I am very new to the parallel computing world. My group use Amazon EC2 and S3 to manage all the data and it really opens a new world to me.

My question is how to estimate costs for computation. Suppose I have n TB data with k files on Amazon S3 (for example, I got 0.5 TB data with 7000 zip files), I would like to loop through all the files, and perform one operation of regex matching using Pig Latin for each line of the files.

I am very interested in estimating these costs:

How many instances should I select to perform this task? What are the capacity of the instance (the size of the master instance and the map-reduce instance)? Can I deduct these capacities and costs based on n and k as well as each operation cost?
I have designed an example data flow: I used one xlarge instance as my master node, and 10 medium instances as my map reduce group. Would this be enough?
How to maximize the bandwidth for each of these instances to fetch data from S3? From my designed dataflow, it looks like the reading speed from S3 is about 250,000,000 bytes per minute. How much data exactly are transported to the ec2 instance? Would this be the bottleneck of my job flow?

score 1 · Accepted Answer · answered Jun 19 '13 at 20:38

1- IMHO, it depends solely on your needs. You need to choose it based on the intensity of computation you are going to perform. You can obviously cut down the cost based on your dataset and the amount of computation you are going to perform on that data.

2- For how much data?What kind of operations?Latency/throughput?For POCs and small projects it seems good enough.

3- It actually depends on several things, like - whether you're in the same region as your S3 endpoint, the particular S3 node you're hitting at a point in time etc. You might be better off using an EBS instance if you need quicker data access, IMHO. You could mount an EBS volume to your EC2 instance and keep the data, which you frequently need, there itself. Otherwise some straightforward solutions are using 10 Gigabit connections between servers or perhaps using dedicated(costly) instances. But, nobody can guarantee whether data transfer will be a bottleneck or not. Sometimes it maybe.

I don't know if this answers you cost queries completely, but their Monthly Calculator would certainly do.

Hi! Thanks for your answer. I still don't have much sense of the computation time and computation costs for my emr work flow. for example, I got 0.5 TB data with 7000 zip files, I want to perform one regrex matching foreach line of the file (each file has 5000 lines). How can I roughly estimate the total time for processing these file with 10 instances? — Seen, Jun 19 '13 at 20:45
@Seen Every workload is different. The best thing to do is test how long a few files take. With that information you should have some idea how long it would take to process the entire dataset. — datasage, Jun 19 '13 at 20:54
@Seen : This is exactly what I was about to write. To start with take a single file and capture each metric and do your calculations accordingly. I mean this is how normally folks go. With 0.5 TB data with 7000 zip files it would be somewhat cumbersome. — Tariq, Jun 19 '13 at 21:00

Estimating computation costs for parallel computing

1 Answers1