I'm trying to use Ray and Docker to launch jobs programatically on EC2. I want to use conda in my Docker container for package management. I've figured out how to build the container such that if I run
docker run -i -t my_container:my_tag /bin/bash
I can launch my jobs in the container locally. The problem is that when I add Ray into the picture to launch the jobs remotely, Ray fails with errors like these:
start: ray: command not found
Cluster: my-cluster
Checking AWS environment settings
AWS config
IAM Profile: ray-head-v1
EC2 Key pair (head & workers): [redacted]
VPC Subnets (head & workers): [redacted]
EC2 Security groups (head & workers): [redacted]
EC2 AMI (head & workers): [redacted]
No head node found. Launching a new cluster. Confirm [y/N]: y [automatic, due to --yes]
Acquiring an up-to-date head node
Launched 1 nodes [subnet_id=[redacted]]
Launched instance i-067e250cc8591da86 [state=pending, info=pending]
Launched a new head node
Fetching the new head node
<1/1> Setting up head node
Prepared bootstrap config
New status: waiting-for-ssh
[1/6] Waiting for SSH to become available
Running `uptime` as a test.
Waiting for IP
Not yet available, retrying in 10 seconds
Not yet available, retrying in 10 seconds
Not yet available, retrying in 10 seconds
Received: 3.21.104.163
SSH still not available SSH command failed., retrying in 5 seconds.
SSH still not available SSH command failed., retrying in 5 seconds.
Success.
Updating cluster configuration. [hash=1e011279ffec6f94b2bff4ebf536e6966be5c79a]
New status: syncing-files
[3/6] Processing file mounts
[4/6] No worker file mounts to sync
New status: setting-up
[3/6] No initialization commands to run.
[4/6] No setup commands to run.
[6/6] Starting the Ray runtime
New status: update-failed
!!!
SSH command failed.
!!!
Failed to setup head node.
At this point I've reached the limit of what I understand about how Ray and Docker interact. I assume the problem is that head_start_ray_commands
gets passed to docker run
somehow. Since Docker uses the sh
shell to run commands, the bash profile isn't getting sourced properly, so packages like conda and ray aren't working. That explains why there's nothing wrong with the container when I launch a bash shell in interactive mode in a local container instance. I've tried adding /bin/bash --login
at the beginning of head_start_ray_commands
but that only seems to cause the whole program to freeze.
What is the right way to get Ray to source the bash profile before executing commands? If that isn't possible, is there a better way to do this? For reference, here's my current ray config:
init:
address: null
remote: {}
cluster:
cluster_name: my-cluster
min_workers: 0
max_workers: 2
initial_workers: 0
autoscaling_mode: default
target_utilization_fraction: 0.8
idle_timeout_minutes: 5
docker:
image: [redacted]
container_name: 'my-container'
pull_before_run: true
run_options: ["--gpus 'all'"]
provider:
type: aws
region: us-east-2
availability_zone: us-east-2a,us-east-2b
cache_stopped_nodes: false
key_pair:
key_name: [redacted]
auth:
ssh_user: ubuntu
head_node:
IamInstanceProfile:
Arn: [redacted]
InstanceType: p2.xlarge
ImageId: ami-08e16447bd5caf26a
worker_nodes:
IamInstanceProfile:
Arn: [redacted]
InstanceType: p2.xlarge
ImageId: ami-08e16447bd5caf26a
file_mounts: {}
initialization_commands: []
setup_commands: []
head_setup_commands: []
worker_setup_commands: []
head_start_ray_commands:
- ray stop
- ulimit -n 65536; ray start --head --port=6379 --object-manager-port=8076
--autoscaling-config=~/ray_bootstrap_config.yaml
worker_start_ray_commands:
- ray stop
- ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076
Edit
The simplest fix seems to be just avoiding conda altogether in favor of venv.