3

I am using the Ray module to launch an Ubuntu (16.04) cluster on AWS EC2. In the configuration I specified min_workers, max_workers and initial_workers as 2, because I do not need any auto-sizing. I also want a t2.micro master-node and c4.8xlarge workers. The cluster launches, but only the master (the following terminal output is from the ray installation onwards, .... minus details):-

2019-04-18 14:52:48,462 INFO updater.py:268 -- NodeUpdater: Running pip3 install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.7.0.dev2-cp35-cp35m-manylinux1_x86_64.whl on 54.226.178.23...
bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
Collecting ray==0.7.0.dev2 from https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.7.0.dev2-cp35-cp35m-manylinux1_x86_64.whl
Downloading https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.7.0.dev2-cp35-cp35m-manylinux1_x86_64.whl (56.2MB)
.....
.....
Successfully built pyyaml
Installing collected packages: click, colorama, six, redis, typing, filelock, flatbuffers, numpy, pyyaml, more-itertools, setuptools, attrs, atomicwrites, pluggy, py, pathlib2, pytest, funcsigs, ray
Successfully installed atomicwrites attrs click colorama filelock flatbuffers funcsigs more-itertools numpy pathlib2 pluggy py pytest pyyaml-3.11 ray redis setuptools-20.7.0 six-1.10.0 typing
You are using pip version 8.1.1, however version 19.0.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
2019-04-18 14:53:32,656 INFO updater.py:268 -- NodeUpdater: Running pip3    install boto3==1.4.8 on 54.226.178.23...
bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
Collecting boto3==1.4.8
Downloading https://files.pythonhosted.org/packages/7d/09/66fef826fb13a2cee74a1df56c269d2794a90ece49c3b77113b733e4b91d/boto3-1.4.8-
....
....
Installing collected packages: docutils, jmespath, six, python-dateutil, botocore, s3transfer, boto3
Successfully installed boto3-1.4.8 botocore-1.8.50 docutils-0.14 jmespath-0.9.4 python-dateutil-2.8.0 s3transfer-0.1.13 six-1.12.0
You are using pip version 8.1.1, however version 19.0.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
2019-04-18 14:53:37,805 INFO updater.py:268 -- NodeUpdater: Running ray stop on 54.226.178.23...
bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
WARNING: Not monitoring node memory since `psutil` is not installed.  Install this with `pip install psutil` (or ray[debug]) to enable debugging of memory-related crashes.
2019-04-18 14:53:39,775 INFO updater.py:268 -- NodeUpdater: Running ulimit -n 65536; ray start --head --redis-port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml on 54.226.178.23...
bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
2019-04-18 18:53:40,167 INFO scripts.py:288 -- Using IP address 172.31.7.117 for this node.
2019-04-18 18:53:40,167 INFO node.py:469 -- Process STDOUT and STDERR is being redirected to /tmp/ray/session_2019-04-18_18-53-40_7981/logs.
2019-04-18 18:53:40,271 INFO services.py:407 -- Waiting for redis server at 127.0.0.1:6379 to respond...
2019-04-18 18:53:40,389 INFO services.py:407 -- Waiting for redis server at 127.0.0.1:60491 to respond...
2019-04-18 18:53:40,390 INFO services.py:804 -- Starting Redis shard with 0.21 GB max memory.
2019-04-18 18:53:40,400 INFO node.py:483 -- Process STDOUT and STDERR is being redirected to /tmp/ray/session_2019-04-18_18-53-40_7981/logs.
2019-04-18 18:53:40,410 INFO services.py:1439 -- Starting the Plasma object store with 0.31 GB memory using /dev/shm.
2019-04-18 18:53:40,421 WARNING services.py:907 -- Failed to start the reporter. The reporter requires 'pip install psutil'.
WARNING: Not monitoring node memory since `psutil` is not installed. Install this with `pip install psutil` (or ray[debug]) to enable debugging of memory-related crashes.
2019-04-18 18:53:40,425 INFO scripts.py:319 -- 
Started Ray on this node. You can add additional nodes to the cluster by calling

    ray start --redis-address 172.31.7.117:6379

from the node you wish to add. You can connect a driver to the cluster from Python by running

import ray
ray.init(redis_address="172.31.7.117:6379")

If you have trouble connecting from a different machine, check that your firewall is configured properly. If you wish to terminate the processes that have been started, run

ray stop
2019-04-18 14:53:40,593 INFO log_timer.py:21 -- NodeUpdater: i-064f62badf69f8cee: Setup commands completed [LogTimer=115941ms]
2019-04-18 14:53:40,593 INFO log_timer.py:21 -- NodeUpdater: i-064f62badf69f8cee: Applied config 248f16e493ac5bcd753a673eb7202fa2b49e0f9f  [LogTimer=173814ms]
2019-04-18 14:53:40,973 INFO log_timer.py:21 -- AWSNodeProvider: Set tag ray-node-status=up-to-date on ['i-064f62badf69f8cee'] [LogTimer=374ms]
2019-04-18 14:53:41,069 INFO commands.py:264 -- get_or_create_head_node:  Head node up-to-date, IP address is: 54.226.178.23
To monitor auto-scaling activity, you can run:

  ray exec ray_config.yaml  'tail -n 100 -f /tmp/ray/session_*/logs/monitor*'

To open a console on the cluster:

  ray attach ray_config.yaml

To ssh manually to the cluster, run:

  ssh -i /home/haines/.ssh/ray-autoscaler_us-east-1.pem ubuntu@54.226.178.23

2019-04-18 14:53:41,181 INFO log_timer.py:21 -- AWSNodeProvider: Set tag ray-runtime-config=248f16e493ac5bcd753a673eb7202fa2b49e0f9f on ['i-064f62badf69f8cee'] 

I used the standard configuration (example-full.yaml) with the following changes:-

min_workers: 2

initial_workers: 2

    type: aws
    region: us-east-1
    availability_zone: us-east1a,us-east-1b


head_node:
    InstanceType: t2.micro
    ImageId: ami-0565af6e282977273 # ubuntu/images/hvm-ssd/ubuntu-xenial-16.04-amd64-server-20190212

worker_nodes:
    InstanceType: c4.8xlarge
    ImageId: ami-0f9cf087c1f27d9b1 # ubuntu/images/hvm-ssd/ubuntu-xenial-16.04-amd64-server-20181114  

        #MarketType: spot

setup_commands:

- echo 'export PATH="$HOME/anaconda3/envs/tensorflow_p36/bin:$PATH"' >>     ~/.bashrc
    - sudo apt-get update
    - sudo apt-get install python3-pip
    - pip3 install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.7.0.dev2-cp35-cp35m-manylinux1_x86_64.whl

    - pip3 install boto3==1.4.8  # 1.4.8 adds InstanceMarketOptions

Latest failed setup:-

setup_commands:
- sudo apt-get update
- wget https://repo.continuum.io/archive/Anaconda3-5.0.1-Linux-x86_64.sh || true 1>/dev/null
- bash Anaconda3-5.0.1-Linux-x86_64.sh -b -p $HOME/anaconda3 || true 1>/dev/null
- echo 'export PATH="$HOME/anaconda3/bin:$PATH"' >> ~/.bashrc
- sudo pkill -9 apt-get || true
- sudo pkill -9 dpkg || true
- sudo dpkg --configure -a
- sudo apt-get install python3-pip || true
- pip3 install --upgrade pip
- pip3 install --user psutil
- pip3 install --user proctitle
- pip3 install --user ray
- pip3 install --user boto3==1.4.8
- pip3 install --user https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.7.0.dev2-cp35-cp35m-manylinux1_x86_64.whl
Nick Mint
  • 145
  • 12
  • Can you share the full `ray_conf.yaml` file? Do the default config files that ship with Ray work for you? E.g., https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/aws/example-full.yaml. – Robert Nishihara Apr 13 '19 at 06:46
  • Yes, that is the file that I used, with a few slight modifications (see addition). Boto didn't seem to find the instances used in the original:- botocore.exceptions.ClientError: An error occurred (InvalidAMIID.NotFound) when calling the RunInstances operation: The image id '[ami-0b294f219d14e6a82]' does not exist Maybe because I changed the region to the one that I usually use. The Key pair that I substituted was one that Ray created when I first launched. Also I disabled the spot option. – Nick Mint Apr 13 '19 at 17:27
  • Sounds like it can't find the AMI. If you changed the region, then you will also need to change the AMI. Note that you appear to be using different AMIs for the head node and the worker node. Is that intentional? – Robert Nishihara Apr 15 '19 at 23:28
  • I changed the AMIs to ones that I know work in the us-east-1 region. If I change the head-node to c4.8xlarge then that also gets launched, but, again, no workers. Do you recognize the error message: it seems to me that it is probably unrelated, and to do with something like updating boto3. Perhaps Ray only spins up workers once a job is submitted? – Nick Mint Apr 16 '19 at 19:36
  • I'm not sure, but the line saying that the command `'['ssh', '-i', '/home/haines/.ssh/ray-autoscaler_us-east-1.pem', '-o', 'ConnectTimeout=120s', '-o', 'StrictHostKeyChecking=no', '-o', 'ControlMaster=auto', '-o', 'ControlPath=/tmp/ray_ssh_sockets/%C', '-o', 'ControlPersist=5m', 'ubuntu@54.89.150.50', "bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && mkdir -p ~'"]'` failed is worth looking into. Can you try running that command separately? Or `ssh`ing to the machine and running the command and seeing if it fails? – Robert Nishihara Apr 16 '19 at 21:02
  • The worker machines should start launching as soon as the head node is successfully set up, but I don't think the head node is getting successfully set up. Btw, did the default autoscaler config file work for you and only start failing after the modifications? If so, I suspect that some command is failing and not propagating the error properly. – Robert Nishihara Apr 16 '19 at 21:03
  • The original config produces an error on the "autoscaling_mode". I removed this, and it worked OK and appears to start a cluster. I cannot check the instances in my EC2 console, because it defaults to US East. The original error, appears to happen when these commands are executed: bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.7.0.dev2-cp36-cp36m-manylinux1_x86_64.whl. Do you think that the relevant installation is inaccessible from nodes in the East? – Nick Mint Apr 17 '19 at 01:18
  • The Python wheel should be accessible from anywhere. Can you try manually `ssh`ing to the machine? E.g., since you should have the IP address and the key used by the autoscaler. Even if one of the autoscaler commands fails you can still ssh to the machine and run the `pip install` command to see what is going wrong. Also, in the EC2 console, you can change the region by clicking on the region name in the upper right and selecting a different region. – Robert Nishihara Apr 17 '19 at 05:25
  • I tried that with the ssh command exactly as shown in the error comment (except for the changed public IP). Then I entered the bash commands: see the attached terminal responses above. The shell doesn't seem to have any access to installation resources. – Nick Mint Apr 17 '19 at 22:43
  • Things seem to start going wrong at the line:- bash: cannot set terminal process group (-1): Inappropriate ioctl for device bash: no job control in this shell – Nick Mint Apr 18 '19 at 13:35
  • If I ssh in and then run the export PATH=etc., the command works, but the problem with pip is still there. – Nick Mint Apr 18 '19 at 14:18
  • It seems that that bash complaint is a red herring and can be ignored. I have managed to get pip installed by first running "sudo apt-get update". So now the error is "ray-0.7.0.dev2-cp36-cp36m-manylinux1_x86_64.whl is not a supported wheel on this platform.". This occurs on both c4.8xlarge and t2.micro instances, both running Ubuntu 16.04, which is what I use on my local machine and on which I have used Ray - so I am baffled. – Nick Mint Apr 18 '19 at 17:19
  • Is the Python version correct? That is, are you using Python 3.6? What about trying just `pip install ray` instead? – Robert Nishihara Apr 18 '19 at 18:46
  • If that fails as well, then it's possible that the command `pip install ...` is not finding the right version of pip? E.g., if you install `pip` with `sudo apt-get install pip`, then it is probably a different version. In general, things work better for me with Anaconda Python (but if you install Anaconda you also have to make sure to add it to your `PATH`). – Robert Nishihara Apr 18 '19 at 18:49
  • Thanks for staying with this problem Robert; you have been very helpful. You are right, there were version conflicts. I have replaced the code above with the latest state of play. It seems that there were two problems: the default python version was 3.5.2 (the one that I am using locally), but the pip installation reverted this to 2, because I didn't specify python3 as the version in the pip install. Moreover, the ray source was cp36; i guessed this was meant for 3.6, so I tried cp35, and it worked. However: despite no errors I still only see the head node launching. Any ideas? – Nick Mint Apr 18 '19 at 20:54
  • I just checked the original config (except for making min and initial workers both 2) on us-west-2, and the result looks pretty much the same, i.e. only a head-node launched. – Nick Mint Apr 18 '19 at 22:11
  • Probably not the issue, but what is your instance limit for `c4.8xlarge` on AWS in the relevant region? – Robert Nishihara Apr 18 '19 at 22:24
  • Probably not the issue, but then line `echo 'export PATH="$HOME/anaconda3/envs/tensorflow_p36/bin:$PATH"' >> ~/.bashrc` only makes sense on the deep learning AMI and probably not on the AMI that you're using. – Robert Nishihara Apr 18 '19 at 22:25
  • No, my limit is 1024. That line just appends the PATH, so it shouldn't have any effect if it is not needed. Do you know what the purpose of the Docker block is; why would one need to containerize? – Nick Mint Apr 19 '19 at 01:00
  • The docker block is optional and is relevant if you want to start the Ray processes inside of a docker container inside of the VM (e.g., because you have a specific docker image you want to use). – Robert Nishihara Apr 19 '19 at 07:13

2 Answers2

1

I ran a slightly modified version of the config you posted, and this works for me

cluster_name: test

min_workers: 2

initial_workers: 2

provider:
    type: aws
    region: us-east-1
    availability_zone: us-east1a,us-east-1b

head_node:
    InstanceType: t2.micro
    ImageId: ami-0565af6e282977273 # ubuntu/images/hvm-ssd/ubuntu-xenial-16.04-amd64-server-20190212

worker_nodes:
    InstanceType: c4.8xlarge
    ImageId: ami-0f9cf087c1f27d9b1 # ubuntu/images/hvm-ssd/ubuntu-xenial-16.04-amd64-server-20181114
        #MarketType: spot

setup_commands:
    - sudo apt-get update
    # Install Anaconda.
    - wget https://repo.continuum.io/archive/Anaconda3-5.0.1-Linux-x86_64.sh || true
    - bash Anaconda3-5.0.1-Linux-x86_64.sh -b -p $HOME/anaconda3 || true
    - echo 'export PATH="$HOME/anaconda3/bin:$PATH"' >> ~/.bashrc
    # Install Ray.
    - pip install ray
    - pip install boto3==1.4.8  # 1.4.8 adds InstanceMarketOptions

The only real difference I think is installing Anaconda Python and putting in on the PATH so that pip finds it properly. I suspect the issue was related to not finding the right version of Python.

Robert Nishihara
  • 3,276
  • 16
  • 17
  • Thanks Robert, that worked for me as well, though I am baffled as to why. I have added the latest state of the setup section that fails (above) including the Anaconda installation that you have. I was driven to adding all of the rest (mainly ensuring the right version of pip3) in order to get that ray-wheels command to execute without errors; yet you seem to have omitted this altogether. – Nick Mint Apr 23 '19 at 15:49
  • After a process of elimination, I find that using the file_mounts section is what was stopping my config from starting workers. I was trying: file_mounts: {"./data": "./data","./": "./test_small.py"} in exactly the format of the original .yaml example. This succeeds in transferring the data directory and test_small.py to the head node, but no workers are started unless I leave the list empty. Any ideas? – Nick Mint Apr 23 '19 at 17:42
  • See (https://stackoverflow.com/questions/56370163/ray-cluster-configuration-file-mounts-section-not-allowing-worker-nodes-to-launch) for the solution to the file_mounts problem. – Nick Mint May 31 '19 at 17:40
0

putting in on the PATH so that pip finds it properly. I suspect the issue was related to not finding the right version of Python.

Probably the default python did not have Ray cluster launcher. I was baffled too why my Ray doesn't spawn workers and it turned out I only had ray[tune] in my pip dependencies, which doesn't include cluster stuff. Adding ray[default] and ray[tune] solved it.

From Ray installation doc

# Install Ray with support for the dashboard + cluster launcher
pip install -U "ray[default]"
Alexey E
  • 11
  • 1