0

I just ran Elastic Map reduce sample application: "Apache Log Processing"

Default: When I ran with default configuration (2 Small sized Core instances) - it took 19 minutes

Scale Out: Then I ran it with configuration: 8 small sized core instances - it took 18 minutes

Scale Up: Then I ran it with configuration: 2 large sized core instances - it took 14 minutes.

What do think about performance of scale up vs scale out when we have bigger data-sets?

Thanks.

paras_doshi
  • 1,027
  • 1
  • 12
  • 19

1 Answers1

0

I would say it depends. I've usually found the raw processing speed to be much better using m1.large and m1.xlarge instances. Other than that, as you've noticed, the same job will probably the same amortized or normalized instance hours to complete.

For your jobs, you might want to experiment with a smaller sample data set at first and see how much time that takes, and then estimate how much time it would take for the full job using large data sets to complete. I've found that to be the best way to estimate the time for job completion.

Suman
  • 9,221
  • 5
  • 49
  • 62
  • Yes, thanks! I've edited my answer. I've also recently found that c1.xlarge might be good for CPU-intensive operations, and m2.2xlarge may be good for memory-intensive operations (like UniqValueCount operations). – Suman Sep 18 '12 at 17:02
  • Yea I ve done similar experiments up to "Cluster Compute Eight Extra Large Instance" for large scale log processing. Seems like the larger instance the more cost efficient it is, i.e. cluster unit is 4 times cheaper than small instance :) – yura Sep 18 '12 at 18:40
  • I agree. A cc2.8xlarge is overkill for most of my jobs :) but I'm assuming that for really large jobs, they do offer bang-for-the-buck. – Suman Sep 18 '12 at 18:49