I have a .Net core application that consists of some background tasks (hosted services) and WEB APIs (which controls and get statuses of those background tasks). Other applications (e.g. clients) communicate with this service through these WEB API endpoints. We want this service to be highly available i.e. if a service crashes then another instance should start doing the work automatically. Also, the client applications should be able to switch to the next service automatically (clients should call the APIs of the new instance, instead of the old one).
The other important requirement is that the task (computation) this service performed in the background can’t be shared between two instances. We have to make sure only one instance does this task at a given time.
What I have done up to now is, I ran two instances of the same service and use a SQL server-based distributed locking mechanism (SqlDistributedLock) to acquire a lock. If a service could acquire a lock then goes and do the operation while the other node waiting to acquire the lock. If one service crashed the next node could be able to acquire the lock. On the client-side, I used Polly based retry mechanism to switch the calling URL to the next node to find the working node.
But this design has an issue, if the node which acquired the lock loses the connectivity to the SQL server then the second service managed to acquire the lock and started doing the work while the first service is also in the middle of doing the same.
I think I need some sought of leader election (seems done it wrongly), Can anyone help me with a better solution for this kind of a problem?