Fault tolerance refers to a system's capability to isolate, compensate for and recover from failure with minimal impact to the end user. When using this tag - include tags indicating the system and/or technology you are working with (as additional support meta-data).
Questions tagged [fault-tolerance]
305 questions
1
vote
1 answer
NServiceBus appropriate for load distribution of periodic tasks
Would NServiceBus or an equivalent ESB be appropriate for an application that has a bunch of different kinds of background maintenance-type tasks? For example:
Scanning databases for the occurence of certain words in user-generated…

jlew
- 10,491
- 1
- 35
- 58
1
vote
1 answer
tf.train.MonitoredTrainingSession arguments
what arguments does config=None take in tf.train.MonitoredTrainingSession?. How can I specify the master node (at for eg localhost:2222) with the proper syntax?
Below is is the error I am encountering when i use config = 'grpc://localhost:2222'…

itsamineral
- 1,369
- 3
- 14
- 19
1
vote
1 answer
tensorflow monitoredsession usage
I have the following code to perform simple arithmetic calculations. I am trying to implement fault tolerance in it by using a Monitored Training session.
import tensorflow as tf
global_step_tensor = tf.Variable(10, trainable=False,…

itsamineral
- 1,369
- 3
- 14
- 19
1
vote
1 answer
Erlang simple_one_for_one supervisor does not restart child
I have a test module and a simple_one_for_one supervisor.
test.erl
-module(test).
-export([
run/1,
do_job/1
]).
run(Fun) ->
test_sup:start_child([Fun]).
do_job(Fun) ->
Pid = spawn(Fun),
io:format("started ~p~n", [Pid]),
…

Amin
- 755
- 6
- 21
1
vote
2 answers
Erlang supervisor does not restart child
I'm trying to learn about erlang supervisors. I have a simple printer process that prints hello every 3 seconds. I also have a supervisor that must restart the printer process if any exception occurs.
Here is my…

Amin
- 755
- 6
- 21
1
vote
1 answer
Committee change in PBFT
I'm implementing a distributed system using Practical Byzantine Fault Tolerance. This method entrusts a committee to vote for each commit. However, if they are all crashed or under DDoS attack the entire network shall breakdown. I'm curious if…

Yangrui
- 1,217
- 2
- 17
- 41
1
vote
1 answer
How to prevent repeated processing of failed message by ask pattern in Akka?
Original description is updated after some investigation:
When I send a message to an actor via the Ask pattern, and the actor fails with an exception, the message is processed again.
The exact number of retries varies, and I was not able to…

Roman
- 64,384
- 92
- 238
- 332
1
vote
1 answer
How to get the exact point of execution in a Java application?
I want to, in a running java application, get the exact point of execution or line of running code.
I'm researching some fault tolerance approaches and trying to implements some solutions. I'm serializing an Thread object to file and forcing an…

Pergentino Araújo
- 11
- 2
1
vote
2 answers
Python requests exception handling Bad except clauses order
I'm writing a fault tolerant HTTP client with the requests library and I want to handle all the exceptions that are defined in requests.exceptions
Here are the exception that are defined within requests.exceptions:
'''
exceptions.BaseHTTPError …

Ricky Wilson
- 3,187
- 4
- 24
- 29
1
vote
1 answer
Sending Java Objects over replica tcp sockets
I want to transfer Java POJOs via TCP.
Let A and B be the participants, and C1 be the main connection between them and C2 be another connection to be used if C1 fails.
I have two kinds of objects: reliable and non-reliable.
When C1 disconnects, each…

user706071
- 805
- 3
- 10
- 25
1
vote
0 answers
How does fault tolerance works in a distributed system?
I didn't have the privilege to take a course on distributed systems. I am reading up on distributed systems and came to know about replication etc.
Can you tell me which strategy is the most popular/most used for handling fault tolerance or does it…

rents
- 768
- 1
- 7
- 22
1
vote
0 answers
How to handle tell/ask failure of Akka peers?
Java API here. Brand new to Akka, and trying to understand how its Fault Tolerance model applies to actor messaging that falls outside the parent/child or supervisor/supervisee pattern.
If my understanding of Akka is correct, one actor can…

smeeb
- 27,777
- 57
- 250
- 447
1
vote
0 answers
Configure hadoop to tolerate server failures
I am trying to configure a 50-node Hadoop 2.6.0 cluster for failure tolerance. Specifically, I'd like to be able to suddenly stop 5 servers and still have my job complete. So far, stopping even 1 server causes my job to fail with too many map…

tix
- 2,138
- 11
- 18
1
vote
0 answers
Fault-tolerance in Apache Sqoop
I want to run incremental nightly job that extracts 100s of GBs of data from Oracle DataWarehouse into HDFS. After processing, the results (few GBs) needs to be exported back to Oracle.
We are running Hadoop in Amazon AWS, and our Data Warehouse is…

Raju Rama Krishna
- 157
- 1
- 1
- 3
1
vote
2 answers
What happens if an MPI process crashes?
I am evaluating different multiprocessing libraries for a fault tolerant application. I basically need any process to be allowed to crash without stopping the whole application.
I can do it using the fork() system call. The limit here is that the…

Pietro
- 12,086
- 26
- 100
- 193