Questions tagged [distributed]

Multiple computers working together, using a network to communicate

A distributed system consists of multiple autonomous computers that communicate through a network. The computers interact with each other in order to achieve a common goal. A computer program that runs in a distributed system is called a distributed program, and distributed programming is the process of writing such programs.

2221 questions

votes

0 answers

How to set up MASTER_PORT and MASTER_ADDR in slurm

In torch’s official documentation that talks about DDP, it said to set it as the following: def setup(rank, world_size): os.environ['MASTER_ADDR'] = 'localhost' os.environ['MASTER_PORT'] = '12355' and now I am using slurm to submit sbatch…

distributed torch slurm

asked Jan 06 '23 at 00:08

rene smith

votes

1 answer

How to spin up/down workers programmatically at run-time on Kubernetes based on new Redis queues and their load?

Suppose I want to implement this architecture deployed on Kubernetes cluster: Gateway Simple RESTful HTTP microservice accepting scraping tasks (URLs to scrape along with postback urls) Request Queues - Redis (or other message broker) queues…

kubernetes web-scraping redis queue distributed

asked Dec 28 '22 at 16:10

Kasparas Taminskas

votes

0 answers

Using torch.distributed modules on AWS instance to parallelise model training by splitting model

I am wondering how to do model parallelism using pytorch's distributed modules. Basically what I want to do is the following - class LargeModel(nn.Module): def __init__(self, in_features, n_hid, out_features) -> None: …

pytorch distributed-computing distributed

asked Dec 27 '22 at 00:33

mndl

votes

1 answer

What do the entries in Lamport clocks representations represent?

I'm trying to understand an illustrative example of how Lamport's algorithm is applied. In the course that I'm taking, we were presented with two representations of the clocks within three [distant] processes, one with the lamport alogrithm applied…

process messaging distributed distributed-system

asked Dec 15 '22 at 18:40

Mehdi Charife

votes

0 answers

How to use multiple GPUs for training?

I am simply trying to understand how to format a config file to allow multiple GPUs/distributed training to take place via the "train" command. The only clear tutorial out there is seemingly for much older versions of AllenNLP: Tutorial: How to…

distributed multi-gpu allennlp

asked Dec 12 '22 at 11:29

Niall Taylor

votes

1 answer

Unable to Run Query with Cascade after Version upgrade from 20.07.2 to 20.07.3

We are new to the dgraph database. After Upgrading the server we are unable to run the below query with the cascade option `query get_tenantlevel_data($tenantuid:string){ get_tenantlevel_data(func:…

database graphql distributed distributed-system dgraph

asked Dec 12 '22 at 10:02

VINOTHSHA

votes

1 answer

Predis - Removing server from the connection pool

Say, I am having N servers in the predis connection pool. I found that when one of the server does down, predis does not work(i.e. new predis/client(s1,s2,...) does not return successfully if any of the server Si is down). First, the entry of that…

redis distributed failover predis

asked Sep 19 '11 at 18:50

Mohit Gupta

votes

0 answers

Parallel Textpreprocessing with Pyro

I am using distributed Gensim for Topic Modeling of documents. However, the preprocessing part is not distributed. Therefore I would like to implement a distributed preprocessing of the text data with Pyro (since distributed Gensim is also based on…

gensim distributed pyro

asked Dec 08 '22 at 07:42

rolr

votes

1 answer

what is the key difference between multipaxos and basic paxos protocol

how is multipaxos different from basic paxos? How does ordering work in multipaxos? Can someone explain multi-paxos along with the diagram Tried out going through videos and research papers but cannot understand the exact difference and concept of…

operating-system distributed-computing distributed distributed-system paxos

asked Dec 08 '22 at 07:30

anonymousM

votes

0 answers

What is the difference of RPC and collective communication

As opposed to point-to-point communication, collectives allow for communication patterns across all processes in a group. What are the main difference of RPC and collective communication? How they differ from the backend?

pytorch process rpc distributed gloo

asked Dec 02 '22 at 05:10

hsh

votes

0 answers

How to use ZooKeeper to distribute work across a cluster of servers

I'm studying up for system design interviews and have run into this pattern in several different problems. Imagine I have a large volume of work that needs to be repeatedly processed at some cadence. For example, I have a large number of alert…

apache-zookeeper distributed-computing distributed system-design

asked Dec 01 '22 at 23:58

Frank Epps

votes

1 answer

torch.distributed.barrier() added on all processes not working

import torch import os torch.distributed.init_process_group(backend="nccl") local_rank = int(os.environ["LOCAL_RANK"]) if local_rank >0: torch.distributed.barrier() print(f"Entered process {local_rank}") if local_rank ==0: …

pytorch distributed torch barrier

asked Nov 13 '22 at 02:29

Priya Gupta

votes

0 answers

Using Dart for desktop, distributed and cloud programming

I would like to use Dart for desktop, distributed and cloud programming. Would Google extend the Flutter framework to include libraries for GUI desktop, distributed and cloud programming, in the same way with Java. To use Dart like a general purpose…

dart user-interface cloud desktop distributed

asked Nov 05 '22 at 06:24

Andrew Goh S M

votes

1 answer

How to do a groupby of a Ray dataset using two keys?

Let’s say I want to groupby A and B and calc the sum of Sales? How should I go about it? import pandas as pd import ray ray.init() rdf = ray.data.from_pandas(pd.DataFrame({'A':[1,2,3],'B':[1,1,4],'Sales':[20,30,40]}) I did try doing…

python python-3.x parallel-processing distributed ray

asked Nov 04 '22 at 11:57

Justice_Lords

votes

1 answer

Does optuna.integration.TorchDistributedTrial support multinode optimization?

Does integration.TorchDistributedTrial support multinode optimization? I'm using Optuna on a SLURM cluster. Suppose I would like to do a distributed hyperparameter optimization using two nodes with two gpus each. Would submitting a script like…

python pytorch distributed optuna

asked Nov 02 '22 at 15:23

Siem

Prev 1 2 3

…

100