ipcluster - can't start more than about 110 ipengines - or maybe some of them die

Question

I'm having difficulty getting ipcluster to start all of the ipengines that I ask for. It appears to be some sort of timeout issue. I'm using IPython 2.0 on a linux cluster with 192 processors. I run a local ipcontroller, and start ipengines on my 12 nodes using SSH. It's not a configuration problem (at least I don't think it is) because I'm having no problems running about 110 ipengines. When I try for a larger amount, some of them seem to die during start up, and I do mean some of them - the final number I have varies a little. ipcluster reports that all engines have started. The only sign of trouble that I can find (other than not having use of all of the requested engines) is the following in some of the ipengine logs:

2014-06-20 16:42:13.302 [IPEngineApp] Loading url_file u'.ipython/profile_ssh/security/ipcontroller-engine.json'
2014-06-20 16:42:13.335 [IPEngineApp] Registering with controller at tcp://10.1.0.253:55576
2014-06-20 16:42:13.429 [IPEngineApp] Starting to monitor the heartbeat signal from the hub every 3010 ms.
2014-06-20 16:42:13.434 [IPEngineApp] Using existing profile dir: u'.ipython/profile_ssh'
2014-06-20 16:42:13.436 [IPEngineApp] Completed registration with id 49
2014-06-20 16:42:25.472 [IPEngineApp] WARNING | No heartbeat in the last 3010 ms (1 time(s) in a row).
2014-06-20 18:09:12.782 [IPEngineApp] WARNING | No heartbeat in the last 3010 ms (1 time(s) in a row).
2014-06-20 19:14:22.760 [IPEngineApp] WARNING | No heartbeat in the last 3010 ms (1 time(s) in a row).
2014-06-20 20:00:34.969 [IPEngineApp] WARNING | No heartbeat in the last 3010 ms (1 time(s) in a row).

I did some googling to see if I could find some wisdom, and the only thing I've come across is http://permalink.gmane.org/gmane.comp.python.ipython.devel/12228. The author seems to think it's a timeout of sorts.

I also tried tripling (90 seconds as opposed to the default 30) the IPClusterStart.early_shutdown and IPClusterEngines.early_shutdown times without any luck.

Thanks - in advance - for any pointers on getting the full use of my cluster.

Hi, I have came across the same problem, have you solved the problem? — GoingMyWay, Sep 29 '15 at 09:17

score 1 · Answer 1 · answered Jul 05 '14 at 21:55

1

When I try execute ipcluster start --n=200 I get: OSError: [Errno 24] Too many open files
This could be what happens to you too. Try raising the open file limit of the OS.

answered Jul 05 '14 at 21:55

Ivelin

12,293
5
37
35

1

ulimit -n reports that I have a limit of 1024 files, so I don't think that is the problem. Please let me know if this isn't the correct diagnostic – DailRowe Jul 09 '14 at 17:48
Also... I can run two different ipclusters - both with 96 threads - concurrently. I assume this would violate the number of open files limit if that was the problem. – DailRowe Jul 09 '14 at 17:53

ipcluster - can't start more than about 110 ipengines - or maybe some of them die

1 Answers1