Questions tagged [fault-tolerance]

Fault tolerance refers to a system's capability to isolate, compensate for and recover from failure with minimal impact to the end user. When using this tag - include tags indicating the system and/or technology you are working with (as additional support meta-data).

305 questions
5
votes
4 answers

Testing fault tolerant code

I’m currently working on a server application were we have agreed to try and maintain a certain level of service. The level of service we want to guaranty is: if a request is accepted by the server and the server sends on an acknowledgement to the…
Robert
  • 6,407
  • 2
  • 34
  • 41
5
votes
2 answers

Articles about replication schemes/algorithms?

I'm designing a distributed system with a certain flow of data in it. I'd like to guarantee that at least N nodes have almost-current data at any given time. I do not need complete consistency, only eventual consistency (t.i. for any time instant,…
jkff
  • 17,623
  • 5
  • 53
  • 85
4
votes
1 answer

Bean Configuration for Circuit Breaker of Resilience4J Using Spring Boot

I want to move my Circuit Breaker Configuration from application.yml file to some config java file as bean declaration beacuse it makes application.yml file to be large, Will it be possible for me to remove the configuration from applciation.yml and…
Monesh
  • 103
  • 2
  • 8
4
votes
1 answer

How Kafka leader replica decides to advance Highwater Mark (HW) when replicating data to follower replicas

I read about Kafka replication protocol. I found that Kafka maintains LEO and HW. As I understood, LEO: Offset of latest message a replica has seen. HW: Offset of the latest message which is guaranteed that each replica has seen. Kafka producer…
wmIbb
  • 125
  • 1
  • 4
  • 19
4
votes
1 answer

what does Zookeeper fault tolerant exactly mean ? simultaneously Or accumulatively?

As mentioned in the ZooKeeper Getting Started Guide , a minimum of three servers are required for a fault tolerant clustered setup, and it is strongly recommended that you have an odd number of servers. So If I got 5 servers, and as mentioned above…
dukyz
  • 347
  • 1
  • 2
  • 12
4
votes
2 answers

Handle child lambda failures

We are trying the lambda for our ETL job which is written in Clojure. Our architecture is the scheduler will trigger the parent lambda, then the parent lambda trigger 100 child lambda and counter lambda. The child lambdas after completion of their…
SANN3
  • 9,459
  • 6
  • 61
  • 97
4
votes
1 answer

Spark fault tolerance for wide dependencies

I'm interested in finding out how Spark implements fault tolerance. In their paper they describe how they do it for "narrow dependencies" like map which is fairly straight forward. However, I they do not state what they do if a node crashes after a…
Dezi
  • 172
  • 1
  • 12
4
votes
1 answer

elastic parallelism and fault-tolerance in distributed Julia

How does Julia expose fault-tolerance - when a node goes down (intentionally or not) and when communication between nodes goes down. I saw a few mentions of such a feature but could not find out exactly how it can be done.
GuSuku
  • 1,371
  • 1
  • 14
  • 30
4
votes
2 answers

Some questions around sockets and accept()

Lets say that we have created a socket with socket(), then we used bind() and listen(). Then we use accept() to wait for client requests, after a client is connected if we shutdown the server (for example we ctrl+c the process). Is the client…
Naoum Mandrelas
  • 258
  • 3
  • 20
4
votes
0 answers

Client side load balancing with redis-py

I have a redis setup with 1 master and 2 slaves on ElastiCache. Master failover is already handled, but I want to make sure: Reads are load-balanced across the three servers Writes only go to the master Should a read fail, we try again at another…
Temuz
  • 1,413
  • 4
  • 14
  • 23
4
votes
1 answer

Intended granularity of Hystrix commands?

I just read the Hystrix docs/wiki and still am missing something at a fundamental level: what is the intended level of granularity for a HystrixCommand impl? For instance, say I have a DAO object that handles CRUD operations for some DB entity, say,…
smeeb
  • 27,777
  • 57
  • 250
  • 447
4
votes
1 answer

How does the HP (Tandem) Non stop compare with Linux clusters?

HP NonStop systems (previously known as "Tandem") are known for their high availability and reliability, and higher price. How do Linux or Unix based clusters compare with them, in these respects and others?
Abhishek Yadav
  • 4,931
  • 4
  • 20
  • 10
4
votes
1 answer

When remote machine dies, MPI manager failed to detect it with MPI_Irecv call

I am writing a program to detect the sudden crash of remote machine. The manager process runs on machine 1 and the worker process runs on machine 2. The manager server sends a message to the worker process by calling MPI_Isend. The remote worker…
4
votes
1 answer

Reconnection to Redis after reboot

I've a bunch long running processes that connect to a Redis server (using Jedis). Everything works fine as long as I don't reboot the machine running Redis or restart the Redis server. As soon as I reboot or restart the connection is lost. Is there…
Soumya Simanta
  • 11,523
  • 24
  • 106
  • 161
4
votes
2 answers

Persisting Akka state in case of a crash

I am a beginner with Akka, and I enjoy many of the functionalities it provides for asynchronous programming, such as Actors, Agents or Futures. A strong selling point of Akka is the fact that when an actor crashes, an equivalent actor is created…
Andrea
  • 20,253
  • 23
  • 114
  • 183