The right way to architect a cluster in EC2

Question

I'm working on open-source tool which will have to run on a cluster in EC2, organized in "one master - several slaves" manner. I need some advice on how to organize things correctly and in the most simple, yet reliable way.

What I basically need is a code which will run on master instance (which user runs manually) and do the following:

a) Run N slave instances (N came from user)

b) After each instance is up and running - connect by SSH and start something.

c) Keep track on slave instances being alive (by e.g. simply pinging them)

d) If slave instance fails - make sure it is terminated, run another one and repeat step b)

e) By signal from user - shutdown slave instances.

All this looks pretty simple and straightforward yet I have some questions:

1) Ready solutions. First I'd taken a look at Zookeeper, but I was frightened by its complexity. It seems to be an overkill for such a simple thing that I need. Another thing I found is StarCluster, it is also in Python which is nice (my tool is in Python too), but I'm not sure it does what I need (keeping track, rerunning instances). My question is: are there simple tools, libraries, frameworks that I'm not aware of?

2) Another way to go will be to implement things myself. The question here is: are there any pitfalls in my problem that I'm not aware of? It all looks simple: several calls to API plus some regular ping, but may be I don't see something here, so it would be really right to use the already written tool?

3) In case of coding it all by myself the question is: to use CloudWatch or not. Does it really makes any difference for managing internal computation clusters or it is only better for helping with high-load sites, etc?.

4) My simple architecture does not have any protection from master node failure. The user runs it, then connects to it via web interface and runs the cluster, but if master node fails - everything gets broken. The slaves can check the existence of master node and terminate themselves in case master node fails. This adds some protection from getting a headless running money-consuming cluster, but that doesn't solve the problem of graceful restart. How to solve this?

5) Are there any other thing to know or important materials to read that I should be familiar with before starting to code this project?

Thank you in advance!

score 2 · Answer 1 · answered Oct 08 '12 at 15:43

You might want to take a look at amazon's autoscaling. Obviously this only handles EC2 instances but handles a lot of the complexity of starting, stopping and monitoring instances for you.

With AutoScaling you create one or more groups. You tell amazon how to create more instances in your group (AMI, userData, type of instance, etc.) and how many instances you want in your group. Amazon will start up as many instances as required and replace them should they fail.

You can use the api to change the number of required nodes (you can set it to 0 if you don't need any instances at that time) or you can have it based on cloudwatch metrics. For example if you used SQS to distribute jobs to your slaves you could configure autoscaling to increase the group size from 0 to the desired number when there are jobs available and to return the group size to 0 once the queue becomes empty.

You can also have multiple groups, for example you might have a group that corresponds to the master node which always has 1 instance (and ec2 will replace it should it fail) and a second group for slaves which will have 0 instances when there is no work to be done and N instances when there is work available.

I've not used the EC2 apis from python myself but I hear that boto does a good job of handling this for you.

score 1 · Answer 2 · edited May 23 '17 at 12:03

One possible approach is to use a PaaS - Platform as a Service - to handle a lot of the plumbing you need. A PaaS will typically handle at-least:

Provisioning VMs
Deploying application code to new VMs
Monitoring VM status and starting new VMs to replace failed ones.
Auto-scaling

You would need to define your application according to the format expected by the PaaS and submit it. The rest should be automated. There is a good comparison of PaaS options here: Looking for PaaS providers recommendations

Disclaimer: I work for GigaSpaces, developer of the Open-Source Paas Stack, Cloudify

The right way to architect a cluster in EC2

2 Answers2