What I am finding with ray, is that the documentation is lacking for the autoscaling, and that the config is easily broken, without a clear reason why.
I first wanted to pull a docker image to the cluster an run this, but the way ray handles dockerimages is very different to pulling a docker image and running it from a remote machine, even with the base rayproject file image in the docker file. I gave up on this approach.
Therefore I am trying an alternative solution which is to pull my pipline from git, install my dependencies in a conda env and run my pipeline, and then submit my script.py job to the cluster.
The only autoscaling example I can get working is the minimal_cluster.yaml config, where workers will be shown as launched on aws. But this is not useful in itself, becuase I need to install a multitude of dependencies on the cluster to run more complex scripts.
cluster_name: minimal
initial_workers: 3
min_workers: 3
max_workers: 3
provider:
type: aws
region: eu-west-2
auth:
ssh_user: ubuntu
head_node:
InstanceType: c5.2xlarge
ImageId: latest_dlami # Default Ubuntu 16.04 AMI.
worker_nodes:
InstanceType: c5.2xlarge
ImageId: latest_dlami # Default Ubuntu 16.04 AMI.
As soon as I attempt to add complexity, the defaults setup gets overrided to manual, and nothing workers will not be initilzed, despite the ray cluster saying its launched in the terminal. (a python script run, will also not initiate workers).
What I want is to launch a cluster, create a conda env, install my dependencies on the conda env, and a python script to run over the whole cluster, where workers will actually be shown to be initilized on my aws ec2 dashboard.
for example, something like this:
cluster_name: ray_cluster
min_workers: 8
max_workers: 8
# Cloud-provider specific configuration.
provider:
type: aws
region: us-east-2
# availability_zone: us-west-2b
auth:
ssh_user: ubuntu
head_node:
InstanceType: c5.2xlarge
ImageId: ami-07c1207a9d40bc3bd # Default Ubuntu 16.04 AMI.
# Set primary volume to 50 GiB
BlockDeviceMappings:
- DeviceName: /dev/sda1
Ebs:
VolumeSize: 50
worker_nodes:
InstanceType: c4.2xlarge
ImageId: ami-07c1207a9d40bc3bd # Default Ubuntu 16.04 AMI.
# Set primary volume to 50 GiB
BlockDeviceMappings:
- DeviceName: /dev/sda1
Ebs:
VolumeSize: 50
# List of shell commands to run to set up nodes.
setup_commands:
# Consider uncommenting these if you run into dpkg locking issues
# - sudo pkill -9 apt-get || true
# - sudo pkill -9 dpkg || true
# - sudo dpkg --configure -a
# Install basics.
- sudo apt-get update
- sudo apt-get install -y build-essential
- sudo apt-get install curl
- sudo apt-get install unzip
# Install Node.js in order to build the dashboard.
- curl -sL https://deb.nodesource.com/setup_12.x | sudo -E bash
- sudo apt-get install -y nodejs
# Install Anaconda.
- wget https://repo.continuum.io/archive/Anaconda3-5.0.1-Linux-x86_64.sh || true
- bash Anaconda3-5.0.1-Linux-x86_64.sh -b -p $HOME/anaconda3 || true
- echo 'export PATH="$HOME/anaconda3/bin:$PATH"' >> ~/.bashrc
# Build env
- git clone pipline
- conda create --name ray_env
- conda activate ray_env
- conda install --name ray_env pip
- pip install --upgrade pip
- pip install ray[all]
- conda env update -n ray_env --file conda_env.yaml
- conda install xgboost
# Custom commands that will be run on the head node after common setup.
head_setup_commands:
- conda activate ray_env
# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands:
- conda activate ray_env
# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
- ray stop
- ulimit -n 65536; ray start --head --port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml
# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
- ray stop
- ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379
# If a node is idle for this many minutes, it will be removed.
idle_timeout_minutes: 5
The script im trying to run is this:
import os
import ray
import time
import sklearn
import xgboost
from xgboost.sklearn import XGBClassifier
def printer():
print("INSIDE WORKER " + str(time.time()) +" PID : "+ str(os.getpid()))
# decorators allow for futures to be created for parallelization
@ray.remote
def func_1():
model = XGBClassifier()
count = 0
for i in range(100000000):
count += 1
printer()
return count
@ray.remote
def func_2():
#model = XGBClassifier()
count = 0
for i in range(100000000):
count += 1
printer()
return count
@ray.remote
def func_3():
count = 0
for i in range(100000000):
count += 1
printer()
return count
def main():
model = XGBClassifier()
start = time.time()
results = []
ray.init(address='auto')
#append fuction futures
for i in range(1000):
results.append(func_1.remote())
results.append(func_2.remote())
results.append(func_3.remote())
#run in parrallel and get aggregated list
a = ray.get(results)
b = 0
#add all values in list together
for j in range(len(a)):
b += a[j]
print(b)
#time to complete
end = time.time()
print(end - start)
if __name__ == '__main__':
main()
and doing
ray submit cluster_minimal.yml ray_test.py -start -- --ray-address='xx.31.xx.xx:6379'
Any help, or any way someone can show me how to do this I would be eternally grateful. A simple template that runs would be incredibly useful. As nothing I try works. If not maybe I might have to move to pyspark or something similar, which would be ashame as the wat ray used decorators and actor is a very nice way of doing things.