In many cases, retries with backoff algorithms are used inside workers. Basically, if a controller calls a worker, the controller just wants to get the job done and retries help to mitigate various temporary issues, like tiny network issues.
The typical logic is (when a worker is called to run a task):
- before calling a request, the worker creates a counter C with initial value of zero; and sets a max attempts value as a configuration, e.g. 3
- the worker waits for C*some_delay time; where some_delay is the interval configured manually (more on this later)
- worker makes a request
- if the request fails, the worker checks if all attempts are done, if so, the failure get sent back to the controller; otherwise, C get increased and the worker goes to step 2
At the end of the day, it will be several calls to a failed resource with delay being increased after each failure.
The delay constant (some_delay in the above text) is picked based on overall system architecture. How long the controller can wait? If the controller itself timeouts at some point (or controllers customers timeout), then the sum of all intervals must be less than that timeout - otherwise there is no point to retry jobs as customers won't be able to get results anyway.
One more topic to consider is what is the thread management approach in your application. While a worker waits for the next retry, the thread will be busy sleeping, that may or may not be a problem.
And the last extra point, if you already have a backoff retry, it may make sense to consider adding a circuit breaker pattern; so if a remote resource is down, the system won't waste time retrying all the time (and keeping threads busy with nothing).