0

I'm trying to use Ray and Docker to launch jobs programatically on EC2. I want to use conda in my Docker container for package management. I've figured out how to build the container such that if I run docker run -i -t my_container:my_tag /bin/bash I can launch my jobs in the container locally. The problem is that when I add Ray into the picture to launch the jobs remotely, Ray fails with errors like these:

start: ray: command not found
Cluster: my-cluster

Checking AWS environment settings
AWS config
  IAM Profile: ray-head-v1
  EC2 Key pair (head & workers): [redacted]
  VPC Subnets (head & workers): [redacted]
  EC2 Security groups (head & workers): [redacted]
  EC2 AMI (head & workers): [redacted]

No head node found. Launching a new cluster. Confirm [y/N]: y [automatic, due to --yes]

Acquiring an up-to-date head node
  Launched 1 nodes [subnet_id=[redacted]]
    Launched instance i-067e250cc8591da86 [state=pending, info=pending]
  Launched a new head node
  Fetching the new head node

<1/1> Setting up head node
  Prepared bootstrap config
  New status: waiting-for-ssh
  [1/6] Waiting for SSH to become available
    Running `uptime` as a test.
    Waiting for IP
      Not yet available, retrying in 10 seconds
      Not yet available, retrying in 10 seconds
      Not yet available, retrying in 10 seconds
      Received: 3.21.104.163
    SSH still not available SSH command failed., retrying in 5 seconds.
    SSH still not available SSH command failed., retrying in 5 seconds.
    Success.
  Updating cluster configuration. [hash=1e011279ffec6f94b2bff4ebf536e6966be5c79a]
  New status: syncing-files
  [3/6] Processing file mounts
  [4/6] No worker file mounts to sync
  New status: setting-up
  [3/6] No initialization commands to run.
  [4/6] No setup commands to run.
  [6/6] Starting the Ray runtime
  New status: update-failed
  !!!
  SSH command failed.
  !!!

  Failed to setup head node.

At this point I've reached the limit of what I understand about how Ray and Docker interact. I assume the problem is that head_start_ray_commands gets passed to docker run somehow. Since Docker uses the sh shell to run commands, the bash profile isn't getting sourced properly, so packages like conda and ray aren't working. That explains why there's nothing wrong with the container when I launch a bash shell in interactive mode in a local container instance. I've tried adding /bin/bash --login at the beginning of head_start_ray_commands but that only seems to cause the whole program to freeze.

What is the right way to get Ray to source the bash profile before executing commands? If that isn't possible, is there a better way to do this? For reference, here's my current ray config:

init:
  address: null
remote: {}
cluster:
  cluster_name: my-cluster
  min_workers: 0
  max_workers: 2
  initial_workers: 0
  autoscaling_mode: default
  target_utilization_fraction: 0.8
  idle_timeout_minutes: 5
  docker:
    image: [redacted]
    container_name: 'my-container'
    pull_before_run: true
    run_options: ["--gpus 'all'"]
  provider:
    type: aws
    region: us-east-2
    availability_zone: us-east-2a,us-east-2b
    cache_stopped_nodes: false
    key_pair:
      key_name: [redacted]
  auth:
    ssh_user: ubuntu
  head_node:
    IamInstanceProfile:
      Arn: [redacted]
    InstanceType: p2.xlarge
    ImageId: ami-08e16447bd5caf26a
  worker_nodes:
    IamInstanceProfile:
      Arn: [redacted]
    InstanceType: p2.xlarge
    ImageId: ami-08e16447bd5caf26a
  file_mounts: {}
  initialization_commands: []
  setup_commands: []
  head_setup_commands: []
  worker_setup_commands: []
  head_start_ray_commands:
  - ray stop
  - ulimit -n 65536; ray start --head --port=6379 --object-manager-port=8076
    --autoscaling-config=~/ray_bootstrap_config.yaml
  worker_start_ray_commands:
  - ray stop
  - ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076

Edit

The simplest fix seems to be just avoiding conda altogether in favor of venv.

Riddler
  • 1
  • 1
  • I think [this question](https://stackoverflow.com/questions/58793062/activate-a-conda-environment-during-ray-setup) is probably related. – Riddler Dec 29 '20 at 01:25
  • 1
    Is Ray installed on the image? Perhaps you may want to include `pip install ray` in the `setup_commands`? – richliaw Dec 29 '20 at 01:26
  • Ray is installed, but in a conda virtual environment. The question is whether it is possible to communicate to docker/ray that the commands need to be run in that environment. Essentially my question reduces to whether that is possible, or if the only solution is to rebuild the container without using conda for package management. – Riddler Dec 29 '20 at 03:14
  • Maybe try replacing `ray start` with `source activate conda_env && ray start` for the `start_ray_commands` – richliaw Dec 29 '20 at 23:29

0 Answers0