0

I am running a code which works perfectly on the cluster, As I increase the number of cores to 3844, I get the following error,

"too many retries sending message to 0x0040:0x00152080, giving up"

Is this error a network problem? or is this related to the code?

I can not post the entire code here unfortunately as it is pretty big

Thanks

JimBamFeng
  • 709
  • 1
  • 4
  • 20
  • Do you have access to sufficient cores? I assume that you're using qsub to send the job to the cluster, are there limits to the number of cores/machines that you can ask for – Eric Yang Aug 01 '18 at 14:46
  • Yes, the limit is much bigger than what I am requesting for, slurm rejects submission that are above the allowed limit – JimBamFeng Aug 01 '18 at 14:53
  • 1
    Are you the administrator of this cluster? If you are not you probably want to talk to them, if you are, I would recommend asking on https://serverfault.com/ with much more information about your installation. Stackoverflow is about programming/code - and this does not appear to be a programming issue. – Zulan Aug 01 '18 at 20:50
  • Thanks for the info, I was not aware of serverfault.com, – JimBamFeng Aug 02 '18 at 14:20

0 Answers0