Fault tolerance refers to a system's capability to isolate, compensate for and recover from failure with minimal impact to the end user. When using this tag - include tags indicating the system and/or technology you are working with (as additional support meta-data).
Questions tagged [fault-tolerance]
305 questions
1
vote
1 answer
Spark 2.4.0 Structured Streaming Fault Tolerance from Kafka
I am having some questions about fault tolerance in Spark Structured Streaming, when reading from kafka. This is from the Structured Streaming Programming Guide:
In case of a failure or intentional shutdown, you can recover the previous progress…

Panagiotis Fytas
- 426
- 2
- 7
- 12
1
vote
1 answer
How to avoid loss of internal state of a master during fail-over to new master during a network partition
I was trying to implement a simple single master node against multiple backup nodes system to learn about distributed and fault tolerant architecture.
Currently this is what my system looks like:
N different nodes, each one identical. 1 master node…

Vikrant Biswas
- 123
- 3
1
vote
4 answers
Temporarily suspend : Azure Service bus Message Queue
We are using Azure Service bus Message Queue to process some action which are performed on third party API, issue we are having the third party API is down , what we want to do is suspend the queue temporarily so we can hold message till the third…

Renu Saini
- 19
- 1
- 3
1
vote
1 answer
WLPs Microprofile Fault Tolerance bulkhead implementation not kicking in
Trying to test the Microprofile Fault Tolerance in WebSphere Liberty (WebSphere Application Server 18.0.0.3/wlp-1.0.22.cl180320180905-2337) on Java HotSpot(TM) 64-Bit Server VM, version 1.8.0_161-b12 (en_US) but i cannot get the bulkhead logic to…

user2299548
- 85
- 5
1
vote
0 answers
Hyperledger Fabric - crash restore strategies
Yesterday faced with a nice problem: Nothing happens in case of chaincode container crash or someone manual stopping it.
Sample network (using v1.2.0 images):
2 ORGs
2 CA's
2 peers ORG1 (using LevelDB as a storage)
2 peers ORG2 (using LevelDB as…

rusbro
- 56
- 7
1
vote
1 answer
How does each backup/nodes get 2f replies in PBFT?
In Practical Byzantine Fault Tolerance(PBFT), the reason why 3f+1 is needed as the way I understand is to allow for the worst case scenario where:
1. f+1 nodes are normal
2. f nodes are unresponsive
3. f nodes are faulty
So in the PREPARE phase,…

Bosen
- 941
- 2
- 12
- 26
1
vote
0 answers
Integration testing with TomEE embedded and Microprofile fault tolerance
I need to test some components in JavaEE environment which are using annotatations from Microprofile project, i.e. @Asynchronous and @Timeout from fault tolerance part of project. Implementation library for fault tolerance is Apache safe guard.
In…

Znas Me
- 190
- 1
- 15
1
vote
2 answers
How does Elasticsearch recover from a quorum that is not unanimous
When using replication with a quorum, Elasticsearch allows writes to fail for some (a small number of) replica shards. Writing to a replica might fail only because it is temporarily unavailable (because of a temporary network partition, for…

Raedwald
- 46,613
- 43
- 151
- 237
1
vote
1 answer
How PBFT applied in block chain?
I am trying to understand how PBFT(practical byzantine fault tolerance) applied in block chain. After reading paper, I found that process for PBFT to reach a consensus is like below:
A client sends a request to invoke a service operation to the…

Frank Kong
- 1,010
- 1
- 20
- 32
1
vote
3 answers
Is there a way to have a block of code executed atomically? (language does not matter)
I'm reading some papers on distributed systems. The authors claim to be able to have a sequence of operations executed atomically (either all operations are executed successfully or none is executed, even when system failures occurs). I wonder how…

Burgess Chen
- 21
- 2
1
vote
1 answer
VMware Fault Tolerance possible Tests
I have been thinking about how I can test my Fault Tolerance machines.
But I can't seem to come with a proper test.
How can I possibly calculate the time it took for VMware to switch from the primary virtual machine to the secondary one?

Youssef Sakuragi
- 136
- 10
1
vote
2 answers
High Availability(HA) vs Fault Tolerance
Read couple of articles on Google like this but still not clear about what is difference b/w them?
Purpose of both seems to provide the services when one component fails (be it hardware or software), a backup/secondary component takes over…

scott miles
- 1,511
- 2
- 21
- 36
1
vote
2 answers
Why Apache Spark not re-submit failed tasks?
I want to simulate fault-tolerance behavior. I wrote "hard" function, that failed from time to time. for example:
def myMap(v: String) = {
// print task info and return "Ok" or throw exception
val context = TaskContext.get()
val r =…

Adel Chepkunov
- 79
- 1
- 9
1
vote
1 answer
VMWare FT Logging
I am newbie to VMWare. So while working on the standard switch i came across FTLogging. I did not found any best source. So can some one please expline where we use FTLogging and under which conditions we need to use FTLooging. What is the use of…

ashok
- 11
- 6
1
vote
1 answer
The impact of correlated failures on cluster performance
In several presentations (e.g, 1, 2, 3) on cluster management, one of the scheduler's objectives is to reduce coordinated failures by distributing tasks of a single job across computing nodes that are less likely to fail together.
Why are correlated…

max
- 49,282
- 56
- 208
- 355