Questions tagged [ray]

Ray is a library for writing parallel and distributed Python applications. It scales from your laptop to a large cluster, has a simple yet flexible API, and provides high performance out of the box.

At its core, Ray is a library for writing parallel and distributed Python applications. Its API provides a simple way to take arbitrary Python functions and classes and execute them in the distributed setting.

Learn more about Ray:

Ray also includes a number of powerful libraries:

  • Cluster Autoscaling: Automatically configure, launch, and manage clusters and experiments on AWS or GCP.
  • Hyperparameter Tuning: Automatically run experiments, tune hyperparameters, and visualize results with Ray Tune.
  • Reinforcement Learning: RLlib is a state-of-the-art platform for reinforcement learning research as well as reinforcement learning in practice.
  • Distributed Pandas: Modin provides a faster dataframe library with the same API as Pandas.
702 questions
4
votes
2 answers

Ray Cluster How to Access all Node Resources

I have access to a cluster of nodes and my understanding was that once I started ray on each node with the same redis address the head node would have access to all of the resources of all of the nodes. main script: export LC_ALL=en_US.utf-8 export…
Lubed Up Slug
  • 168
  • 1
  • 11
4
votes
1 answer

Early stopping ray.tune experiments when complex conditions are met?

Is there a way of stopping ray.tune experiments (for example using PBT) when clearly overfitting or the one metric did not improve for a long time?
4
votes
1 answer

How to set up ray project autoscaling on GCP

I am having real difficulty setting up ray auto-scaling on google cloud compute. I can get it to work on AWS no problem, but I keep running into the following error when running ray up: googleapiclient.errors.HttpError:…
4
votes
2 answers

Ray: How to run many actors on one GPU?

I have only one gpu, and I want to run many actors on that gpu. Here's what I do using ray, following https://ray.readthedocs.io/en/latest/actors.html first define the network on gpu class Network(): def __init__(self, ***some args here***): …
Maybe
  • 2,129
  • 5
  • 25
  • 45
4
votes
1 answer

How to use GPUs with Ray in Pytorch? Should I specify the num_gpus for the remote class?

When I use the Ray with pytorch, I do not set any num_gpus flag for the remote class. I get the following error: RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. The main process is: I…
Han Zheng
  • 309
  • 2
  • 8
3
votes
2 answers

Distributing a python function across multiple worker nodes

I'm trying to understand what would be a good framework that integrates easily with existing python code and allows distributing a huge dataset across multiple worker nodes to perform some transformation or operation on it. The expectation is that…
Tushar
  • 528
  • 4
  • 20
3
votes
0 answers

Ray doesn't work in a docker container (linux)

I have a python code that uses ray. It works locally on my mac, but once I try to run it inside a local docker container I get the following: A warning: WARNING services.py:1922 -- WARNING: The object store is using /tmp instead of /dev/shm because…
HagaiA
  • 193
  • 3
  • 15
3
votes
1 answer

Ray has weird time consuming

I have a tiny Ray pipeline like this: import ray import numpy as np import time @ray.remote class PersonDetector: def __init__(self) -> None: self.model = self._init_model() def _init_model(self): s =…
Nicholas Jela
  • 2,540
  • 7
  • 24
  • 40
3
votes
1 answer

Pool in a Ray cluster is sending the same number of jobs to different nodes even though the nodes have different sizes/different number of CPUs

I am using Pool in a Ray cluster. I want to be able to scale the number of jobs sent to different nodes proportionately to the compute capability (e.g., the number of CPUs) that each node has. Unfortunately, the Ray cluster pool I set up is sending…
emmanuelsa
  • 657
  • 6
  • 9
3
votes
1 answer

Ray: creating a singleton Actor

I'm trying to find an elegant way to make sure a Ray actor gets instantiated only one time (like a singleton) so that if someone calls Singleton.remote() the already launched actor is returned. Is that possible? The common singleton decorator…
Unziello
  • 103
  • 8
3
votes
0 answers

How really make action masking in Ray (rllib)?

1) It's unclear how to make action masking just more complex in rllib than we can find in examples. This mask works good from example action_mask_model.py with class TorchActionMaskModel(TorchModelV2, nn.Module) self.observation_space = Dict({ …
sirjay
  • 1,767
  • 3
  • 32
  • 52
3
votes
1 answer

Ray training with PyTorch and PyTorchLightning raise ValueError("Expected a parent")

I have a code that has a data module and a model and I am training my model with Ray trainer, here is my code: class CSIDataset(pl.LightningDataModule): def __init__(self, pkl_dir): super().__init__() self.samples…
Samira Khorshidi
  • 963
  • 1
  • 9
  • 29
3
votes
1 answer

Limiting CPU resources of Ray

I'm trying to manage the resources of a remote machine that we use for a daily task (that uses Ray). Is it possible to limit the number of CPUs (or equivalently the number of workers) that Ray uses? The remote machine has 16 cores. Can I limit Ray…
M.Erkin
  • 120
  • 1
  • 6
3
votes
1 answer

How do I checkpoint only the best model from a ray tune run?

NOTE: To some extent, this was already asked here but my question tackles a different aspect of getting the best checkpoint. In the referenced question, the author only desired to retrieve the best checkpoint from a set of checkpoints after the ray…
c0mr4t
  • 311
  • 2
  • 17
3
votes
0 answers

Unable to pass multiple value parameters in config to my model, when using Ray with PyTorch

I am new to PyTorch and Ray. I was trying to tune my lightning model's hyperparameters using Ray, but when I passed multiple value parameters in the config dictionary, I got an error like this: TypeError: empty() received an invalid combination of…