I'm working on open-source tool which will have to run on a cluster in EC2, organized in "one master - several slaves" manner. I need some advice on how to organize things correctly and in the most simple, yet reliable way.
What I basically need is a code which will run on master instance (which user runs manually) and do the following:
a) Run N slave instances (N came from user)
b) After each instance is up and running - connect by SSH and start something.
c) Keep track on slave instances being alive (by e.g. simply pinging them)
d) If slave instance fails - make sure it is terminated, run another one and repeat step b)
e) By signal from user - shutdown slave instances.
All this looks pretty simple and straightforward yet I have some questions:
1) Ready solutions. First I'd taken a look at Zookeeper, but I was frightened by its complexity. It seems to be an overkill for such a simple thing that I need. Another thing I found is StarCluster, it is also in Python which is nice (my tool is in Python too), but I'm not sure it does what I need (keeping track, rerunning instances). My question is: are there simple tools, libraries, frameworks that I'm not aware of?
2) Another way to go will be to implement things myself. The question here is: are there any pitfalls in my problem that I'm not aware of? It all looks simple: several calls to API plus some regular ping, but may be I don't see something here, so it would be really right to use the already written tool?
3) In case of coding it all by myself the question is: to use CloudWatch or not. Does it really makes any difference for managing internal computation clusters or it is only better for helping with high-load sites, etc?.
4) My simple architecture does not have any protection from master node failure. The user runs it, then connects to it via web interface and runs the cluster, but if master node fails - everything gets broken. The slaves can check the existence of master node and terminate themselves in case master node fails. This adds some protection from getting a headless running money-consuming cluster, but that doesn't solve the problem of graceful restart. How to solve this?
5) Are there any other thing to know or important materials to read that I should be familiar with before starting to code this project?
Thank you in advance!