Questions tagged [fault-tolerance]

Fault tolerance refers to a system's capability to isolate, compensate for and recover from failure with minimal impact to the end user. When using this tag - include tags indicating the system and/or technology you are working with (as additional support meta-data).

305 questions
1
vote
1 answer

Spark 2.4.0 Structured Streaming Fault Tolerance from Kafka

I am having some questions about fault tolerance in Spark Structured Streaming, when reading from kafka. This is from the Structured Streaming Programming Guide: In case of a failure or intentional shutdown, you can recover the previous progress…
1
vote
1 answer

How to avoid loss of internal state of a master during fail-over to new master during a network partition

I was trying to implement a simple single master node against multiple backup nodes system to learn about distributed and fault tolerant architecture. Currently this is what my system looks like: N different nodes, each one identical. 1 master node…
1
vote
4 answers

Temporarily suspend : Azure Service bus Message Queue

We are using Azure Service bus Message Queue to process some action which are performed on third party API, issue we are having the third party API is down , what we want to do is suspend the queue temporarily so we can hold message till the third…
Renu Saini
  • 19
  • 1
  • 3
1
vote
1 answer

WLPs Microprofile Fault Tolerance bulkhead implementation not kicking in

Trying to test the Microprofile Fault Tolerance in WebSphere Liberty (WebSphere Application Server 18.0.0.3/wlp-1.0.22.cl180320180905-2337) on Java HotSpot(TM) 64-Bit Server VM, version 1.8.0_161-b12 (en_US) but i cannot get the bulkhead logic to…
1
vote
0 answers

Hyperledger Fabric - crash restore strategies

Yesterday faced with a nice problem: Nothing happens in case of chaincode container crash or someone manual stopping it. Sample network (using v1.2.0 images): 2 ORGs 2 CA's 2 peers ORG1 (using LevelDB as a storage) 2 peers ORG2 (using LevelDB as…
rusbro
  • 56
  • 7
1
vote
1 answer

How does each backup/nodes get 2f replies in PBFT?

In Practical Byzantine Fault Tolerance(PBFT), the reason why 3f+1 is needed as the way I understand is to allow for the worst case scenario where: 1. f+1 nodes are normal 2. f nodes are unresponsive 3. f nodes are faulty So in the PREPARE phase,…
Bosen
  • 941
  • 2
  • 12
  • 26
1
vote
0 answers

Integration testing with TomEE embedded and Microprofile fault tolerance

I need to test some components in JavaEE environment which are using annotatations from Microprofile project, i.e. @Asynchronous and @Timeout from fault tolerance part of project. Implementation library for fault tolerance is Apache safe guard. In…
1
vote
2 answers

How does Elasticsearch recover from a quorum that is not unanimous

When using replication with a quorum, Elasticsearch allows writes to fail for some (a small number of) replica shards. Writing to a replica might fail only because it is temporarily unavailable (because of a temporary network partition, for…
Raedwald
  • 46,613
  • 43
  • 151
  • 237
1
vote
1 answer

How PBFT applied in block chain?

I am trying to understand how PBFT(practical byzantine fault tolerance) applied in block chain. After reading paper, I found that process for PBFT to reach a consensus is like below: A client sends a request to invoke a service operation to the…
Frank Kong
  • 1,010
  • 1
  • 20
  • 32
1
vote
3 answers

Is there a way to have a block of code executed atomically? (language does not matter)

I'm reading some papers on distributed systems. The authors claim to be able to have a sequence of operations executed atomically (either all operations are executed successfully or none is executed, even when system failures occurs). I wonder how…
1
vote
1 answer

VMware Fault Tolerance possible Tests

I have been thinking about how I can test my Fault Tolerance machines. But I can't seem to come with a proper test. How can I possibly calculate the time it took for VMware to switch from the primary virtual machine to the secondary one?
1
vote
2 answers

High Availability(HA) vs Fault Tolerance

Read couple of articles on Google like this but still not clear about what is difference b/w them? Purpose of both seems to provide the services when one component fails (be it hardware or software), a backup/secondary component takes over…
scott miles
  • 1,511
  • 2
  • 21
  • 36
1
vote
2 answers

Why Apache Spark not re-submit failed tasks?

I want to simulate fault-tolerance behavior. I wrote "hard" function, that failed from time to time. for example: def myMap(v: String) = { // print task info and return "Ok" or throw exception val context = TaskContext.get() val r =…
1
vote
1 answer

VMWare FT Logging

I am newbie to VMWare. So while working on the standard switch i came across FTLogging. I did not found any best source. So can some one please expline where we use FTLogging and under which conditions we need to use FTLooging. What is the use of…
ashok
  • 11
  • 6
1
vote
1 answer

The impact of correlated failures on cluster performance

In several presentations (e.g, 1, 2, 3) on cluster management, one of the scheduler's objectives is to reduce coordinated failures by distributing tasks of a single job across computing nodes that are less likely to fail together. Why are correlated…
max
  • 49,282
  • 56
  • 208
  • 355