I have a task running on an EC2 cluster which starts to slow down progressively as virtual CPUs are employed (regardless of EBS volume size). To avoid this I want to disable hyperthreading on all nodes and was trying to implement the advice given here: https://aws.amazon.com/blogs/compute/disabling-intel-hyper-threading-technology-on-amazon-linux/.
I am using Ray to launch the cluster in Ubuntu 18.04, and assumed that the initialization_commands section in the config.yaml file is the appropriate place to implement the bash commands (the bootcmd: heading is not understood there). I have tried a number of different formats but none seem to work; e.g.:-
# List of commands run before setup_commands.
initialization_commands:
- for cpunum in $(cat /sys/devices/system/cpu/cpu*/topology/thread_siblings_list | cut -s -d, -f2- | tr ',' '\n' | sort -un); do echo 0 > /sys/devices/system/cpu/cpu$cpunum/online; done
produces this error:-
bash: syntax error near unexpected token `sudo'
2020-07-26 22:53:04,949 INFO log_timer.py:17 -- NodeUpdater: i-0eefc0511ce029fb3: Initialization commands completed [LogTimer=139ms]
2020-07-26 22:53:04,949 INFO log_timer.py:17 -- NodeUpdater: i-0eefc0511ce029fb3: Applied config 39910e8bc12541ca5e316063231a2493642efee4 [LogTimer=60603ms]
2020-07-26 22:53:04,950 ERROR updater.py:348 -- NodeUpdater: i-0eefc0511ce029fb3: Error updating (Exit Status 1) ssh -i /home/haines/.ssh/ray-key2_us-east-1.pem -o ConnectTimeout=120s -o StrictHostKeyChecking=no -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_98734ce2b6/5f5c61af53/%C -o ControlPersist=10s -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 ubuntu@3.93.77.73 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && for cpunum in $(cat /sys/devices/system/cpu/cpu*/topology/thread_siblings_list | cut -s -d, -f2- | tr '"'"','"'"' '"'"'\n'"'"' | sort -un); sudo echo 0 > /sys/devices/system/cpu/cpu$cpunum/online; done'
Exception in thread Thread-2:
Traceback (most recent call last):
File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/home/haines/Projects/VF83/Ray_Cloud/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 351, in run
raise e
File "/home/haines/Projects/VF83/Ray_Cloud/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 341, in run
self.do_update()
File "/home/haines/Projects/VF83/Ray_Cloud/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 426, in do_update
self.cmd_runner.run(cmd)
File "/home/haines/Projects/VF83/Ray_Cloud/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 263, in run
self.process_runner.check_call(final_cmd)
File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['ssh', '-i', '/home/haines/.ssh/ray-key2_us-east-1.pem', '-o', 'ConnectTimeout=120s', '-o', 'StrictHostKeyChecking=no', '-o', 'ControlMaster=auto', '-o', 'ControlPath=/tmp/ray_ssh_98734ce2b6/5f5c61af53/%C', '-o', 'ControlPersist=10s', '-o', 'IdentitiesOnly=yes', '-o', 'ExitOnForwardFailure=yes', '-o', 'ServerAliveInterval=5', '-o', 'ServerAliveCountMax=3', 'ubuntu@3.93.77.73', 'bash', '--login', '-c', '-i', '\'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && for cpunum in $(cat /sys/devices/system/cpu/cpu*/topology/thread_siblings_list | cut -s -d, -f2- | tr \'"\'"\',\'"\'"\' \'"\'"\'\\n\'"\'"\' | sort -un); sudo echo 0 > /sys/devices/system/cpu/cpu$cpunum/online; done\'']' returned non-zero exit status 1.
2020-07-26 22:53:05,018 INFO log_timer.py:17 -- AWSNodeProvider: Set tag ray-node-status=setting-up on ['i-0eefc0511ce029fb3'] [LogTimer=205ms]
2020-07-26 22:53:05,140 ERROR commands.py:285 -- get_or_create_head_node: Updating 3.93.77.73 failed
I have tried using separate lines, and putting the commands in the setup_commands section instead, but none of these work. Is there an easier way?
Update: I guess that the syntax error may be to do with some spacing or characters (though I have tried many variants), but even without the loop, i.e. only the sudo echo command writing to one cpu, I get a permission error:-
bash: /sys/devices/system/cpu/cpu50/online: Permission denied
Update 2: I find that there is a simpler method: "export OMP_NUM_THREADS=1" but this seems to have no effect if done via a bash command in the setup. I am using Ray 0.8.6 which, I think, is supposed to set OMP_NUM_THREADS=1, but it isn't defined on the head-node when the cluster is up and running.