2

The title is a bit misleading, so let me explain further.

I have a non thread-safe dll I have no choice but to use as part of my back end servers. I can't use it directly in my servers as the thread issues it has causes it to crash. So, I created an akka.net cluster of N nodes each which hosts a single actor. All of my API calls that were originally to that bad dll are now routed through messages to these nodes through a round-robin group. As each node only has a single, single threaded actor, I get safe access, but as I have N of them running I get parallelism, of a sort.

In production, I have things configured with auto-down = false and default timings on heartbeats and so on. This works perfectly. I can fire up new nodes as needed, they get added to the group, I can remove them with Cluster.Leave and that is happy as well.

My issue is with debugging. In our development environment we keep a cluster of 20 nodes each exposing a single actor as described above that wraps this dll. We also have a set of nodes that act as seed nodes and do nothing else.

When our application is run it joins the cluster. This allows it to direct requests through the round-robin router to the nodes we keep up in our cluster. When doing development and testing and debugging the app, if I configure things to use auto-down = false we end up with problems whenever a test run crashes or we stop the application with out going through proper cluster leaving logic. Such as when we terminate the app with the stop button in the debugger.

With out auto-down, this leaves us with a missing member of the cluster that causes the leader to disallow additions to the cluster. This means that the next time I run the app to debug, I cant join the cluster and am stuck.

It seems that I have to have auto-down set to get debugging to work. If it is set, then when I crash my app the node is removed from the cluster 5 seconds later. When I next fire up my app, the cluster is back in a happy state and I can join just fine.

The problem with this is that if I am debugging the application and pause it for any amount of time, it is almost immediately seen as unreachable and then 5 seconds later is thrown out of the cluster. Basically, I can't debug with these settings.

So, I set failure-detector.acceptable-heartbeat-pause = 600s to give me more time to pause the app while debugging. I will get shutdown in 10 min, but I don't often sit in the debugger for that long, so its an acceptable trade-off. The issue with this is of course that when I crash the app, or stop it in the debugger, the cluster thinks it exists for the next 10 minutes. No one tries to talk to these nodes directly, so in theory that isn't a huge issue, but I keep running into cases where the test I just ran got itself elected as role leader. So the role leader is now dead, but the cluster doesn't know it yet. This seems to prevent me from joining anything new to the cluster until my 10 min are up. When I try to leave the cluster nicely, my dead node gets stuck at the exiting state and doesn't get removed for 10 minutes. And I don't always get notified of the removal either, forcing me to set a timeout on leaving that will cause it to give up.

There doesn't seem to be any way to say "never let me be the leader". When I have run the app with no role set for the cluster it seems to often get itself elected as the cluster leader causing the same problem as when the role leader is dead but unknown to be so, but at a larger level.

So, I don't really see any way around this, but maybe someone has some tricks to pull this off. I want to be able to debug my cluster member without it being thrown out of the cluster, but I also don't want the cluster to think that leader nodes are around when they aren't, preventing me from rejoining during my next attempt.

Any ideas?

Guy
  • 110
  • 10

0 Answers0