Questions tagged [fault-tolerance]

Fault tolerance refers to a system's capability to isolate, compensate for and recover from failure with minimal impact to the end user. When using this tag - include tags indicating the system and/or technology you are working with (as additional support meta-data).

305 questions
0
votes
1 answer

Example of getting C++ call stack on Windows

Can someone give an example of how to get programmatically the call stack of the currently running C++ program on Windows? From some threads (e.g. print call stack in C or C++ ) I have got a suggestion to use DbgHelp. However the library seems quite…
Serge Rogatch
  • 13,865
  • 7
  • 86
  • 158
0
votes
1 answer

working with multiprocessing using map() in python code in bigdata

I'm trying to get some values(which I get using extract function) from urls which are stored in data.file and there are about 3000000 url links in the file. here is my code snippet, from multiprocessing import Pool p = Pool(10) revenuelist =…
0
votes
1 answer

Passive Replication in Distributed Systems - Replacing the Primary Server

In a passive replication based distributed system, if the primary server fails, one of the backups is promoted as primary. However, suppose that the original primary server recovers, then how do we switch back the primary server to it from the…
singhuist
  • 302
  • 1
  • 6
  • 17
0
votes
1 answer

How Fault tolerance be implemented using Azure Traffic Manager

Say we have 2 Azure hosted sites; one is asia (ap.test.com) and one in europe (eu.test.com) which are load balanced via Azure Traffic Manager. As this works at DNS level and user is directly connected to say my asia website (due to say low latency).…
0
votes
2 answers

Website/webserver fault tolerance - the best practices

For example, I have two servers in the same network, with identical code/software. If the primary server goes down, I want the second one to become primary. I heard about the following approaches: Run a proxy server (nginx/haproxy/etc.) in front of…
Oleg
  • 22,300
  • 9
  • 68
  • 84
0
votes
2 answers

Does Apache Helix support partition split and merge?

I understand that Apache Helix allows dynamic cluster expansion/shrinkage (e.g, adding/failing/removing physical nodes). However, in the case that a single physical node can not handle a single partition replica, I need to split a partition into…
0
votes
1 answer

Spark Streaming fault tolerance on DStream batches

Suppose if a stream is received at time X. Suppose my batch duration is 1 minute. Now my executors are processing the first batch. But this execution takes 3 minutes till X+3. But at X+1 and X+2 we receive other two batches. Does that mean that at…
Vijay Krishna
  • 1,037
  • 13
  • 19
0
votes
0 answers

Spark - how to keep data integrity when writing files to appended folder

In my organization we have application that gets events and stores them on s3 partitioned by day. Some of the events are offline which means that while writing we append the files to the proper folder (according to the date of the offline event). We…
Tal Joffe
  • 5,347
  • 4
  • 25
  • 31
0
votes
1 answer

how can I reliably process thousands of HTTP requests when some may error?

I have run into this problem before for a few HTTP transactions (like a hundred or so posts). Today I'm trying to do 7k HTTP requests. This seems silly but it's the only way to interact with the target system. The best I've been able to do will…
jcollum
  • 43,623
  • 55
  • 191
  • 321
0
votes
1 answer

Is zookeeper survives after falling one node in a cluster of three nodes?

I saw, it was similar question at Zoopekeeper instances in Kafka. But the question remained unanswered. So my extended version of question (with more details) Environment: There are 3 nodes of business application. Each application contains its…
0
votes
0 answers

Fault tolerance / Error handling by spark-submit

I have a spark job which I'm running using the following command: sudo ./bin/spark-submit --jars lib/spark-streaming-kafka-assembly_2.10-1.4.1.jar \ --packages TargetHolding:pyspark-cassandra:0.2.4…
HackCode
  • 1,837
  • 6
  • 35
  • 66
0
votes
1 answer

Storm's fault tolerance: is the data lost when a worker died

I got a question about the fault tolerance. Considering the word count= ing topology you have given, the bolt "WordCount" may have many tasks, and "fieldsGrouping" is used to ensure the same word always be assigned to the same task. My question…
Weizhou He
  • 202
  • 2
  • 12
0
votes
1 answer

sand boxing threads without separate processes

In the interest of ease of programming (local function calls instead of IPC) and performance (e.g. avoiding copies of large buffers), I'd like to have a Java VM call native code using JNI instead of through interprocess communication. There would be…
Yale Zhang
  • 1,447
  • 12
  • 30
0
votes
1 answer

VMWare Fault Tolerant Benchmarks

We are looking at deploying high availability VMs using VMWare Fault Tolerant. Has anyone got any real world benchmarks? I'm only after relative performance of running a VM conventionally in ESX v's running in FT mode. I assume there must be some…
0
votes
2 answers

Create fault tolerance example with Dynamodb streams

I have been looking at DynamoDB to create something close to a transaction. I was watching this video presentation: https://www.youtube.com/watch?v=KmHGrONoif4 in which the speaker shows around the 30 minute mark ways to make dynamodb operation…