I am very new to the parallel computing world. My group use Amazon EC2 and S3 to manage all the data and it really opens a new world to me.
My question is how to estimate costs for computation. Suppose I have n TB data with k files on Amazon S3 (for example, I got 0.5 TB data with 7000 zip files), I would like to loop through all the files, and perform one operation of regex matching using Pig Latin for each line of the files.
I am very interested in estimating these costs:
- How many instances should I select to perform this task? What are the capacity of the instance (the size of the master instance and the map-reduce instance)? Can I deduct these capacities and costs based on n and k as well as each operation cost?
- I have designed an example data flow: I used one xlarge instance as my master node, and 10 medium instances as my map reduce group. Would this be enough?
- How to maximize the bandwidth for each of these instances to fetch data from S3? From my designed dataflow, it looks like the reading speed from S3 is about 250,000,000 bytes per minute. How much data exactly are transported to the ec2 instance? Would this be the bottleneck of my job flow?