Fault tolerance refers to a system's capability to isolate, compensate for and recover from failure with minimal impact to the end user. When using this tag - include tags indicating the system and/or technology you are working with (as additional support meta-data).
Questions tagged [fault-tolerance]
305 questions
3
votes
2 answers
How to design : Avoid resource leaking when randomly accessing files
I have client/server application where the client app will open files. Those files get split in chunks, and sent to the server.
Not only does the client send file chunks, but it sends other data as well. Each message (data or filechunk) has a…

Charles V.G.
- 169
- 4
- 11
3
votes
2 answers
What it the real benefit from Erlang's fault tolerance for a web project?
Let's assume we have a web project in which we want to have ~10000 web clients connected to the server simultaneously. Let's also assume that one client session lasts about 25 minutes.
If we compare LAMP stack or any other popular web…

skanatek
- 5,133
- 3
- 47
- 75
3
votes
1 answer
If a node of a DHT fails, will the values become unavailable?
I'm reading up about DHTs, but struggle to find information on what the consequences are for DHT values when a node fails.
As far as I understand, without redundancy of data (hash table values) the failure of a single node would simply make the…

creativecoding
- 247
- 2
- 9
3
votes
3 answers
Fault Tolerance in MapReduce
I was reading about Hadoop and how fault tolerant it is. I read the HDFS and read how failure of master and slave nodes can be handled. However, i couldnt find any document that mentions how the mapreduce performs fault tolerance. Particularly, what…

Chander Shivdasani
- 9,878
- 20
- 76
- 107
3
votes
2 answers
How to make reliable, scalable redis on Kubernetes
I have been searching alot on how to deploy redis with high availability on kubernetes.
I have some problems using redis cluster mode
and when using the master-slave mode we need to also deploy sentinel to be able to handle master failures
I have…

ElGenius
- 135
- 1
- 1
- 11
3
votes
0 answers
Bulk Unload from Redshift to S3 Interrupted
I wrote a python script that will do a bulk unload of all tables within a schema to s3, which scales to petabytes of data. While my script was running perfectly okay, my python script got interrupted due to a network disconnection.
Now, I'm in the…

Praneeth Turlapati
- 56
- 4
3
votes
1 answer
Achieve Fault Tolerance with Consul Cluster
I have created consul server cluster using different ports in localhost.
I used below commands for that.
server 1:
consul agent -server -bootstrap-expect=3 -data-dir=consul-data -ui -bind=127.0.0.1 -dns-port=8601 -http-port=8501 -serf-lan-port=8303…

Ishara Madhawa
- 3,549
- 5
- 24
- 42
3
votes
1 answer
Python ZeroMQ broadcasting messages
I am going to implement a Practical Byzantine Fault Tolerance ( PBFT ).
Hence, I am going to have multiple processes, P0 is going to initialize a round, by sending a first message.
Is it possible to broadcast a message to all other processes using…

dilot
- 67
- 6
3
votes
1 answer
Fault Tolerance of FlinkKafkaConsumer in HiBench
I am running some experiments to test the fault tolerance capabilities of Apache Flink. I am currently using the HiBench framework with the WordCount micro benchmark implemented for Flink.
I noticed that if I kill a TaskManager during an execution,…

Valerio
- 105
- 1
- 6
3
votes
1 answer
Do we need PBFT algorithm support in permissioned Block chain networks?
I am new to BCT. My question is why do we need a consensus algorithm such as PBFT in a permission based Block chain network where the nodes are trusted nodes. Is it only to find a way when nodes fail or is there any other use case. Can anyone…

Satya Narayana
- 454
- 6
- 20
3
votes
4 answers
How does HP/Tandem NonStop achieve single failure FT without spares?
As far as I could gather from Wikipedia and the mindboggling HPE website, the claim to fame of the NonStop system architecture is that it can achieve a single-failure FT without having to allocate excessive amounts of spare capacity (i.e. in…

ddimitrov
- 3,293
- 3
- 31
- 46
3
votes
0 answers
Why should an HDFS cluster not be stretched across DCs?
It's easy to find well regarded references stating that HDFS should not be stretched across data centers [1], while Kafka should be stretched [2].
What specific issues make HDFS ill-suited to being stretched?
I'm considering stretching HDFS across…

Paul Carey
- 1,768
- 1
- 17
- 19
3
votes
3 answers
What is the purpose of stopping actors in Akka?
I have read the Akka docs on fault tolerance & supervision, and I think I totally get them, with one big exception (no pun intended).
Why would you ever want/need to stop a child actor???
The only clue in the docs is:
Closer to the Erlang way is…

smeeb
- 27,777
- 57
- 250
- 447
3
votes
2 answers
Akka + WithinTimeRange
I've testing the fault tolerant system of akka and so far it's been good when talking about retrying to send a msg according the maxNrOfRetries specified.
However, it does not restart the actor within the given time range, it restarts all at once,…

Thiago Pereira
- 1,724
- 1
- 17
- 31
3
votes
2 answers
Mitigating Hadoop's Achilles tendons
I just gave this Hadoop tuorial a read which state that Hadoop has an Achilles' tendon (a single point of failure) in JobTracker:
The JobTracker is a single point of failure for the Hadoop MapReduce service which means if JobTracker goes down, all…

smeeb
- 27,777
- 57
- 250
- 447