Questions tagged [fault-tolerance]

Fault tolerance refers to a system's capability to isolate, compensate for and recover from failure with minimal impact to the end user. When using this tag - include tags indicating the system and/or technology you are working with (as additional support meta-data).

305 questions
1
vote
0 answers

Is it possible to ignore failed tasks in Spark

I've some large datasets where some records cause a UDF to crash. Once such a record is processed, the task will fail which leads to job failes. The problems here are native (we use a native fortran library with JNA), so I cannot catch them in the…
Raphael Roth
  • 26,751
  • 15
  • 88
  • 145
1
vote
2 answers

Microservices: how to track fallen down services?

Problem: Suppose there are two services A and B. Service A makes an API call to service B. After a while service A falls down or to be lost due to network errors. How another services will guess that an outbound call from service A is lost / never…
1
vote
1 answer

Why is exception in Spring Batch AsycItemProcessor caught by SkipListener's onSkipInWrite method?

I'm writing a Spring Boot application that starts up, gathers and converts millions of database entries into a new streamlined JSON format, and then sends them all to a GCP PubSub topic. I'm attempting to use Spring Batch for this, but I'm running…
SnoopDougg
  • 1,467
  • 2
  • 19
  • 35
1
vote
1 answer

Exactly-once: who is storing the historical data, flink or the data source

I've known that Apache Flink have the capacity of Exactly once, which relies on the checkpoint mechanism and the resendable data source. As my understanding, if an operator of Flink gets some error, it needs to make its last operation to run again,…
Yves
  • 11,597
  • 17
  • 83
  • 180
1
vote
1 answer

How to recover a critical python job from system failure

Is there any python library that would provide a (generic) job state journaling and recovery functionality? Here's my use case: data received to start a job job starts processing job finishes processing I then want to be able to restart a job back…
Garrett Motzner
  • 3,021
  • 1
  • 13
  • 30
1
vote
0 answers

AWS Lightsail instance overcoming system hang

Suppose, I have a lightsail instance running some java backend. For whatever reason, this instance could potentially hang/freeze and basically require a reboot. What is the standard/recommended way of dealing with such cases ? Basically, client…
1
vote
1 answer

Fault tolerant for Kafka Direct Stream do not work. Checkpoint directory does not exist

I Write app for read data from Kafka topic. And I can’t achieve fault tolerance in the event of a driver failure. The application runs in a k8s cluster using spark submit. When I run my application for the first time, everything goes well, but when…
1
vote
0 answers

Rest API server fault tolerance

I have an API implemented with Jersey. An external service is making HTTP requests to my endpoint and expects a result into the response. Currently, I am using the Asynchronous Server API, in order to make all my business logic and then respond when…
1
vote
0 answers

nodejs server recovering from crash

I'm attempting to build resilient and fault-tolerant services, As a test, I'm intentionally causing it to crash with an undefined variable outside the try-catch block. I'm calling this service locally via nodejs http client, first I intentionally…
AppDeveloper
  • 1,816
  • 7
  • 24
  • 49
1
vote
1 answer

How to use MicroProfile FaultTolerance in Liberty profile

I want to use the circuit break function in MP FaultTolerance feature in my web application. Now I have no idea about how to know if this function has been working in my application. And I want to track the value of MP Metrics added by MP Fault…
Joe
  • 33
  • 3
1
vote
1 answer

RabbitMQ Sequential Publish with ACK and NACKs(Reject & Nack handling) Synchronous

Console.WriteLine($"Publishing to Default EXG & queue: {result.QueueName}"); Todo : BUILD Retries, Republish and Acknowledgement Workflow. IBasicProperties messageProps = _channel.CreateBasicProperties(); …
1
vote
1 answer

Spring Cloud Gateway and fault tolerance

I was reading about spring cloud architecture and technologies (like eureka, hystrix circuit breaker) used to prevent your application from downtime because of failure of some of yours microservices. And all in all spring cloud suggests to use…
Pasha
  • 642
  • 6
  • 22
1
vote
1 answer

Why typecasting is required in below code?

I have a snipped as mentioned below for step processing. @Bean @JobScope public Step readStep(StepBuilderFactory sbf, ItemReader reader,ItemWriter writer, TaskExecutor taskExecutor, RefFileReaderComponentFactory componentFactory, String file){ …
1
vote
1 answer

Tolerance stack combination with 3 resistor values

I am trying to do perform a simple tolerance stack circuit analysis using python instead of excel. Basically, say I have the resistor values below where it is separated by -> Minimum | Nominal | Maximum, hence the value below: R1 -> 5 | 10 | 15 R2…
1
vote
1 answer

is there a "fault-tolerant" or "loose" mode for @babel/parser?

I'm interested in using @babel/parser to parse a Javascript source file which may or may not contain syntax errors. acorn-loose is a thing, and esprima can be passed a tolerant flag with a value of true; is there an equivalent for babel 7?
Dan O
  • 6,022
  • 2
  • 32
  • 50