Launch AWS multi-node Ray cluster and submit simple python script to run in conda env

Question

What I am finding with ray, is that the documentation is lacking for the autoscaling, and that the config is easily broken, without a clear reason why.

I first wanted to pull a docker image to the cluster an run this, but the way ray handles dockerimages is very different to pulling a docker image and running it from a remote machine, even with the base rayproject file image in the docker file. I gave up on this approach.

Therefore I am trying an alternative solution which is to pull my pipline from git, install my dependencies in a conda env and run my pipeline, and then submit my script.py job to the cluster.

The only autoscaling example I can get working is the minimal_cluster.yaml config, where workers will be shown as launched on aws. But this is not useful in itself, becuase I need to install a multitude of dependencies on the cluster to run more complex scripts.

cluster_name: minimal

initial_workers: 3
min_workers: 3
max_workers: 3

provider:
    type: aws
    region: eu-west-2

auth:
    ssh_user: ubuntu

head_node:
    InstanceType: c5.2xlarge
    ImageId: latest_dlami  # Default Ubuntu 16.04 AMI.

worker_nodes:
    InstanceType: c5.2xlarge
    ImageId: latest_dlami  # Default Ubuntu 16.04 AMI.

As soon as I attempt to add complexity, the defaults setup gets overrided to manual, and nothing workers will not be initilzed, despite the ray cluster saying its launched in the terminal. (a python script run, will also not initiate workers).

What I want is to launch a cluster, create a conda env, install my dependencies on the conda env, and a python script to run over the whole cluster, where workers will actually be shown to be initilized on my aws ec2 dashboard.

for example, something like this:

cluster_name: ray_cluster

min_workers: 8
max_workers: 8

# Cloud-provider specific configuration.
provider:
    type: aws
    region: us-east-2
    # availability_zone: us-west-2b

auth:
    ssh_user: ubuntu

head_node:
    InstanceType: c5.2xlarge
    ImageId: ami-07c1207a9d40bc3bd  # Default Ubuntu 16.04 AMI.

    # Set primary volume to 50 GiB
    BlockDeviceMappings:
        - DeviceName: /dev/sda1
          Ebs:
              VolumeSize: 50

worker_nodes:
    InstanceType: c4.2xlarge
    ImageId: ami-07c1207a9d40bc3bd  # Default Ubuntu 16.04 AMI.

    # Set primary volume to 50 GiB
    BlockDeviceMappings:
        - DeviceName: /dev/sda1
          Ebs:
              VolumeSize: 50



# List of shell commands to run to set up nodes.
setup_commands:
    # Consider uncommenting these if you run into dpkg locking issues
    # - sudo pkill -9 apt-get || true
    # - sudo pkill -9 dpkg || true
    # - sudo dpkg --configure -a
    # Install basics.
    - sudo apt-get update
    - sudo apt-get install -y build-essential
    - sudo apt-get install curl
    - sudo apt-get install unzip
    # Install Node.js in order to build the dashboard.
    - curl -sL https://deb.nodesource.com/setup_12.x | sudo -E bash
    - sudo apt-get install -y nodejs
    # Install Anaconda.
    - wget https://repo.continuum.io/archive/Anaconda3-5.0.1-Linux-x86_64.sh || true
    - bash Anaconda3-5.0.1-Linux-x86_64.sh -b -p $HOME/anaconda3 || true
    - echo 'export PATH="$HOME/anaconda3/bin:$PATH"' >> ~/.bashrc
    # Build  env

    - git clone pipline
    
    - conda create --name ray_env
    - conda activate ray_env
    - conda install --name ray_env pip
    - pip install --upgrade pip
    - pip install ray[all]
    - conda env update -n ray_env --file conda_env.yaml
    - conda install xgboost

# Custom commands that will be run on the head node after common setup.
head_setup_commands: 
        - conda activate ray_env

# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: 
        - conda activate ray_env
        
# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
    - ray stop
    - ulimit -n 65536; ray start --head --port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - ray stop
    - ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379        
        
        
# If a node is idle for this many minutes, it will be removed.
idle_timeout_minutes: 5

The script im trying to run is this:

import os
import ray
import time
import sklearn
import xgboost
from xgboost.sklearn import XGBClassifier



def printer():
    print("INSIDE WORKER " + str(time.time()) +"  PID  :    "+  str(os.getpid()))


# decorators allow for futures to be created for parallelization
@ray.remote        
def func_1():
    model = XGBClassifier()
    count = 0
    for i in range(100000000):
        count += 1
    printer()
    return count
        
        
@ray.remote        
def func_2():
    #model = XGBClassifier()
    count = 0
    for i in range(100000000):
        count += 1
    printer()
    return count

    
@ray.remote
def func_3():
    count = 0
    for i in range(100000000):
        count += 1
    printer()
    return count

def main():
    model = XGBClassifier()

    start = time.time()
    results = []
    
    ray.init(address='auto')
    #append fuction futures
    for i in range(1000):
        results.append(func_1.remote())
        results.append(func_2.remote())
        results.append(func_3.remote())
        
    #run in parrallel and get aggregated list
    a = ray.get(results)
    b = 0
    
    #add all values in list together
    for j in range(len(a)):
        b += a[j]
    print(b)
    
    #time to complete
    end = time.time()
    print(end - start)
    
    
if __name__ == '__main__':
    main()

and doing

ray submit cluster_minimal.yml ray_test.py -start -- --ray-address='xx.31.xx.xx:6379'

Any help, or any way someone can show me how to do this I would be eternally grateful. A simple template that runs would be incredibly useful. As nothing I try works. If not maybe I might have to move to pyspark or something similar, which would be ashame as the wat ray used decorators and actor is a very nice way of doing things.

score 0 · Answer 1 · answered Jan 10 '21 at 18:18

Thanks for asking this question. Your feedback is very important for us. Next time you have an issue, please file an issue on our github repo (https://github.com/ray-project/ray/issues/new/choose) with reproduction code and output you saw so we can keep track of the issue and it wouldn't get lost. We would also love to improve the autoscaler documentation, Can you please provide more information on what would you like to know and how we can improve it?

For your question, I copy pasted your files and ran exactly what you ran with the latest ray nightly (https://docs.ray.io/en/master/installation.html#daily-releases-nightlies). The only difference was that I ran: "ray submit cluster_minimal.yml ray_test.py --start" (without ray address and with two dashes for start, not sure what you mean by providing a ray-address before the cluster launched already).

Ray is printing a clear error:

        (9/18) git clone pipline
    fatal: repository 'pipline' does not exist
    Shared connection to 3.5.zz.yy closed.
      New status: update-failed
      !!!
      SSH command failed.
      !!!
      
      Failed to setup head node.

You seem to be trying to call git clone pipeline but I am not sure what you expect this to do. Can you try using the latest ray nightly and post here what is the output you are getting and which ray version are you using?

Thank you ameer, i have posted an answer above, but it only partially answers the question. Do you know why it does not launch workers? — jtm101, Jan 11 '21 at 13:26

score 0 · Answer 2 · answered Jan 11 '21 at 13:25

I have run the following and the installation of xgboost to the conda env onto the cluster, and this has been successful with the below set up, as the import xgboost is found on the cluster when running ray_test.py as shown below (it has been altered). I ran it as suggested by Ameer with the following command with ray, version 1.0.0, python 3.6.12 :: Anaconda, Inc. This therefore answers this part.

What is does not do is launch the workers on aws ec2--> instances. There are no workers to be found. Please can someone advise why it does not launch workers, but only the head?

$ ray submit cluster.yaml ray_test.py --start

This is the updated config:

cluster_name: ray_cluster

min_workers: 3
max_workers: 3

# Cloud-provider specific configuration.
provider:
    type: aws
    region: eu-west-2
    # availability_zone: us-west-2b

auth:
    ssh_user: ubuntu

head_node:
    InstanceType: c5.2xlarge
    ImageId: latest_dlami  # Default Ubuntu 16.04 AMI.

worker_nodes:
    InstanceType: c5.2xlarge
    ImageId: latest_dlami  # Default Ubuntu 16.04 AMI.


# List of shell commands to run to set up nodes.
setup_commands:
    # Consider uncommenting these if you run into dpkg locking issues
#    - sudo pkill -9 apt-get || true
#    - sudo pkill -9 dpkg || true
#    - sudo dpkg --configure -a
    # Install basics.
    - sudo apt-get update
    - sudo apt-get install -y build-essential
    - sudo apt-get install curl
    - sudo apt-get install unzip
    # Install Node.js in order to build the dashboard.
    - curl -sL https://deb.nodesource.com/setup_12.x | sudo -E bash
    - sudo apt-get install -y nodejs
    # Install Anaconda.
    - wget https://repo.continuum.io/archive/Anaconda3-5.0.1-Linux-x86_64.sh || true
    - bash Anaconda3-5.0.1-Linux-x86_64.sh -b -p $HOME/anaconda3 || true
    - echo 'export PATH="$HOME/anaconda3/bin:$PATH"' >> ~/.bashrc
    # Build  env

    - conda create --name ray_env
    - conda activate ray_env
    - conda install --name ray_env pip
    - pip install --upgrade pip
    - pip install ray[all]
    - conda install -c conda-forge xgboost


# Custom commands that will be run on the head node after common setup.
head_setup_commands:
    - source activate ray_env

# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands:
    - source activate ray_env

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
    - ray stop
    - ulimit -n 65536; ray start --head --port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - ray stop
    - ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379


# If a node is idle for this many minutes, it will be removed.
idle_timeout_minutes: 5

This is the updated ray_test.py:

#This imports successfully with config and conda env
import ray
import time
import xgboost




def printer():
    print("INSIDE WORKER " + str(time.time()) +"  PID  :    "+  str(os.getpid()))


# decorators allow for futures to be created for parallelization
@ray.remote        
def func_1():
    count = 0
    for i in range(100000000):
        count += 1
    printer()
    return count
        
        
@ray.remote        
def func_2():
    count = 0
    for i in range(100000000):
        count += 1
    printer()
    return count

    
@ray.remote
def func_3():
    count = 0
    for i in range(100000000):
        count += 1
    printer()
    return count

def main():
    start = time.time()
    results = []
    
    ray.init(address='auto')
    #append fuction futures
    for i in range(1000):
        results.append(func_1.remote())
        results.append(func_2.remote())
        results.append(func_3.remote())
        
    #run in parrallel and get aggregated list
    a = ray.get(results)
    b = 0
    
    #add all values in list together
    for j in range(len(a)):
        b += a[j]
    print(b)
    
    #time to complete
    end = time.time()
    print(end - start)
    
    
if __name__ == '__main__':
    main()

Launch AWS multi-node Ray cluster and submit simple python script to run in conda env

2 Answers2