Questions tagged [distributed]

Multiple computers working together, using a network to communicate

A distributed system consists of multiple autonomous computers that communicate through a . The computers interact with each other in order to achieve a common goal. A computer program that runs in a distributed system is called a distributed program, and distributed programming is the process of writing such programs.

2221 questions
0
votes
0 answers

How to set up MASTER_PORT and MASTER_ADDR in slurm

In torch’s official documentation that talks about DDP, it said to set it as the following: def setup(rank, world_size): os.environ['MASTER_ADDR'] = 'localhost' os.environ['MASTER_PORT'] = '12355' and now I am using slurm to submit sbatch…
rene smith
  • 83
  • 1
  • 9
0
votes
1 answer

How to spin up/down workers programmatically at run-time on Kubernetes based on new Redis queues and their load?

Suppose I want to implement this architecture deployed on Kubernetes cluster: Gateway Simple RESTful HTTP microservice accepting scraping tasks (URLs to scrape along with postback urls) Request Queues - Redis (or other message broker) queues…
0
votes
0 answers

Using torch.distributed modules on AWS instance to parallelise model training by splitting model

I am wondering how to do model parallelism using pytorch's distributed modules. Basically what I want to do is the following - class LargeModel(nn.Module): def __init__(self, in_features, n_hid, out_features) -> None: …
mndl
  • 17
  • 3
0
votes
1 answer

What do the entries in Lamport clocks representations represent?

I'm trying to understand an illustrative example of how Lamport's algorithm is applied. In the course that I'm taking, we were presented with two representations of the clocks within three [distant] processes, one with the lamport alogrithm applied…
Mehdi Charife
  • 722
  • 1
  • 7
  • 22
0
votes
0 answers

How to use multiple GPUs for training?

I am simply trying to understand how to format a config file to allow multiple GPUs/distributed training to take place via the "train" command. The only clear tutorial out there is seemingly for much older versions of AllenNLP: Tutorial: How to…
0
votes
1 answer

Unable to Run Query with Cascade after Version upgrade from 20.07.2 to 20.07.3

We are new to the dgraph database. After Upgrading the server we are unable to run the below query with the cascade option `query get_tenantlevel_data($tenantuid:string){ get_tenantlevel_data(func:…
0
votes
1 answer

Predis - Removing server from the connection pool

Say, I am having N servers in the predis connection pool. I found that when one of the server does down, predis does not work(i.e. new predis/client(s1,s2,...) does not return successfully if any of the server Si is down). First, the entry of that…
Mohit Gupta
  • 649
  • 1
  • 7
  • 15
0
votes
0 answers

Parallel Textpreprocessing with Pyro

I am using distributed Gensim for Topic Modeling of documents. However, the preprocessing part is not distributed. Therefore I would like to implement a distributed preprocessing of the text data with Pyro (since distributed Gensim is also based on…
rolr
  • 21
  • 2
0
votes
1 answer

what is the key difference between multipaxos and basic paxos protocol

how is multipaxos different from basic paxos? How does ordering work in multipaxos? Can someone explain multi-paxos along with the diagram Tried out going through videos and research papers but cannot understand the exact difference and concept of…
0
votes
0 answers

What is the difference of RPC and collective communication

As opposed to point-to-point communication, collectives allow for communication patterns across all processes in a group. What are the main difference of RPC and collective communication? How they differ from the backend?
hsh
  • 111
  • 2
  • 8
0
votes
0 answers

How to use ZooKeeper to distribute work across a cluster of servers

I'm studying up for system design interviews and have run into this pattern in several different problems. Imagine I have a large volume of work that needs to be repeatedly processed at some cadence. For example, I have a large number of alert…
0
votes
1 answer

torch.distributed.barrier() added on all processes not working

import torch import os torch.distributed.init_process_group(backend="nccl") local_rank = int(os.environ["LOCAL_RANK"]) if local_rank >0: torch.distributed.barrier() print(f"Entered process {local_rank}") if local_rank ==0: …
0
votes
0 answers

Using Dart for desktop, distributed and cloud programming

I would like to use Dart for desktop, distributed and cloud programming. Would Google extend the Flutter framework to include libraries for GUI desktop, distributed and cloud programming, in the same way with Java. To use Dart like a general purpose…
0
votes
1 answer

How to do a groupby of a Ray dataset using two keys?

Let’s say I want to groupby A and B and calc the sum of Sales? How should I go about it? import pandas as pd import ray ray.init() rdf = ray.data.from_pandas(pd.DataFrame({'A':[1,2,3],'B':[1,1,4],'Sales':[20,30,40]}) I did try doing…
0
votes
1 answer

Does optuna.integration.TorchDistributedTrial support multinode optimization?

Does integration.TorchDistributedTrial support multinode optimization? I'm using Optuna on a SLURM cluster. Suppose I would like to do a distributed hyperparameter optimization using two nodes with two gpus each. Would submitting a script like…
Siem
  • 72
  • 9
1 2 3
99
100