22

With the advent of docker and scheduling & orchestration services like Amazon's ECS, I'm trying to determine the optimal way to deploy my Node API. With Docker and ECS aside, I've wanted to take advantage of the Node cluster library to gracefully handle crashing the node app in the event of an asynchronous error as suggested in the documentation, by creating a master process and multiple worker processors.

One of the benefits of the cluster approach, besides gracefully handling errors, is creating a worker processor for each available CPU. But does this make sense in the docker world? Would it make sense to have multiple node processes running in a single docker container that was going to be scaled into a cluster of EC2 instances on ECS?

Without the Node cluster approach, I'd lose the ability to gracefully handle errors and so I think that at a minimum, I should run a master and one worker processes per docker container. I'm still confused as to how many CPUs to define in the Task Definition for ECS. The ECS documentation says something about each container instance having 1024 units per CPU; but that isn't the same thing as EC2 compute units, is it? And with that said, I'd need to pick EC2 instance types with the appropriate amount of vCPUs to achieve this right?

I understand that achieving the most optimal configuration may require some level of benchmarking my specific Node API application, but it would be awesome to have a better idea of where to start. Maybe there is some studying/research I need to do? Any pointers to guide me on the path or recommendations would be most appreciated!

Edit: To recap my specific questions:

  1. Does it make sense to run a master/worker cluster as described here inside a docker container to achieve graceful crashing?

  2. Would it make sense to use nearly identical code as described in the Cluster docs, to 'scale' to available CPUs via require('os').cpus().length?

  3. What does Amazon mean in the documentation for ECS Task Definitions, where it says for the cpus setting, that a container instance has 1024 units per CPU? And what would be a good starting point for the this setting?

  4. What would be a good starting point for the instance type to use for an ECS cluster aimed at serving a Node API based on the above? And how do the available vCPUs affect the previous questions?

Aaron Storck
  • 886
  • 7
  • 16

3 Answers3

5

All these technologies are new and best practices are still being established, so consider these to be tips from my experience only.

One-process-per-container is more of a suggestion than a hard and fast rule. It's fine to run multiple processes in a container when you have a use for it, especially in this case where a master process forks workers. Just use a single container and allow it to fork one process per core, as you've suggested in the question.

On EC2, instance types have a number of vCPUs, which will appear as a core to the OS. For the ECS cluster use an EC2 instance type such as the c3.xlarge with four vCPUs. In ECS this translates to 4096 CPU units. If you want the app to make use of all 4 vCPUs, create a task definition that requires 4096 cpu units.

But if you're doing all this only to stop the app from crashing you could also just use a restart policy to restart the container if it crashes. It appears that restart policies are not yet supported by ECS though.

Ben Whaley
  • 32,811
  • 7
  • 87
  • 85
  • Thanks for the response, definitely helpful. So one docker container per EC2 instance? That was definitely one of the ways that crossed my mind, it keeps it simpler. I just wasn't sure if there was an advantage to stacking docker containers on an EC2 instance, but I guess if I'm forking for each vCPU at the app level (inside the container), stacking containers doesn't make much sense. With regard to why I'm doing this, it's actually not about stopping the app from crashing, it's about crashing gracefully (stop listening for http requests, record the error, and then end the process). – Aaron Storck Jan 02 '15 at 05:49
  • It's also useful to have the master process periodically kill the workers as a fail-safe against memory leaks. – Aaron Storck Jan 02 '15 at 05:51
  • Glad it helped. I understand the fail safe goal, but it does seem like potentially hiding bugs in the app through an autorestart mechanism should be a last resort. And sure, running only one container per instance is fine if you're utilizing all the cores. – Ben Whaley Jan 02 '15 at 13:31
  • Totally agree. It is definitely a last resort. Good news is that there are some wonderful tools in the node ecosystem to find and address memory leaks and a major effort will definitely be there to do so during development and staging. However, in production, I'd rather memory leaks not creep up unexpectedly. No harm in restarting the process when its idle. – Aaron Storck Jan 02 '15 at 16:05
1

That seems like a really good pattern. It's similar to what is done with Erlang/OTP, and I don't think anyone would argue that it's one of the most robust systems on the planet. Now the question is how to implement.

I would leverage patterns from Heroku or other similar PaaS systems that have a little bit more maturity. I'm not saying that amazon is the wrong place to do this, but simply that a lot of work has been done with this in other areas that you can translate. For instance, this article has a recipe in it: https://devcenter.heroku.com/articles/node-cluster

As far as the relationships between vCPU and Compute Units, it looks like it's just a straight ratio of 1/1024. It is a move toward microcharges based on CPU utilization. They are taking these even farther with the lambda work. They are charging you based on fractions of a second that you utilize.

Jason Mcmunn
  • 341
  • 1
  • 6
  • Thanks for the response. I agree Heroku and others have developed some pretty awesome systems. And in the docker-sphere, there is even a project called [Deis](http://deis.io/overview/) which is a heroku-inspired open source custom PaaS built on Docker and CoreOS. That said, I'd like to try and move forward with AWS' new container service, ECS. That said, in AWS EC2, compute units represent a metric they've come up with, and "provides the relative measure of the integer processing power of an Amazon EC2 instance." This is not the same as the cpus setting on the ECS Task Definition. – Aaron Storck Dec 26 '14 at 01:18
  • I guess I'm still a little confused as to the ideal number of processes to run on a EC2 instance. Maybe I need to do more research on how the hypervisor works, but.. If I had a docker container which had 2 processes in it, I'm trying to figure out how many containers to run per EC2, and that is dependent on the EC2 instance type and the ideal setup is dependent upon the configuration of the Task Definition. These are the things I'm trying to understand. – Aaron Storck Dec 26 '14 at 01:27
0

In the docker world you would run 1 nodejs per docker container but you would run many such containers on each of your ec2 instances. If you use something like fig you can use fig scale <n> to run many redundant containers an an instance. This way you don't have to have to define your nodejs count ahead of time and each of your nodejs processes is isolated from the others.

Usman Ismail
  • 17,999
  • 14
  • 83
  • 165
  • Yeah, I wouldn't be using anything like fig in production; as I mentioned in my question, my plan is to use Amazon's Container Service which handles orchestration & scheduling. That said, even in a fig environment, without having multiple processes running at the application level you can't gracefully capture and handle errors as described in the Cluster module documentation I referenced as well. It's also not uncommon to run multiple processes inside a individual container; the documentation and many solutions suggest such a pattern. – Aaron Storck Dec 21 '14 at 01:33
  • All things said, I'm still looking for answers with regard to the choices of cpu units on the Task Definition for ECS, and with EC2, the vCPUs and Compute Units. – Aaron Storck Dec 21 '14 at 01:38