I have this application using Akka.net cluster feature. The people who wrote the code have left the company. I am trying to understand the code and we are planning a deployment.
The cluster has 2 types of nodes
QueueServicer: supports sharding and only these nodes should participate in sharding.
LightHouse: They are just seed nodes, nothing else.
Lighthouse : 2 nodes
QueueServicer : 3 Nodes
I see one of the QueueServicer node unable to join the cluster. Both lighthouse nodes are refusing connection. It constantly tries to join and never succeeds. This has been happening for the last 5 days or so and the node is never dying also. Its CPU and memory usage is high. Also It doesn't have any queue processor actors running when filtered search through the log. It takes long hours for Garbage collection etc. I see in the log for this node, the following.
{"timestamp":"2021-09-08T22:26:59.025Z", "logger":"Akka.Event.DummyClassForStringSources", "message":Tried to associate with unreachable remote address [akka.tcp://myapp@lighthouse-1:7892]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: [Association failed with akka.tcp://myapp@lighthouse-1:7892] Caused by: [System.AggregateException: One or more errors occurred. (Connection refused akka.tcp://myapp@lighthouse-1:7892) ---> Akka.Remote.Transport.InvalidAssociationException: Connection refused akka.tcp://myapp@lighthouse-1:7892 at Akka.Remote.Transport.DotNetty.TcpTransport.AssociateInternal(Address remoteAddress) at Akka.Remote.Transport.DotNetty.DotNettyTransport.Associate(Address remoteAddress) --- End of inner exception stack trace --- at System.Threading.Tasks.Task1.GetResultCore(Boolean waitCompletionNotification) at Akka.Remote.Transport.ProtocolStateActor.<>c.<InitializeFSM>b__12_18(Task
1 result) at System.Threading.Tasks.ContinuationResultTaskFromResultTask`2.InnerInvoke() at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
{"timestamp":"2021-09-08T22:26:59.025Z", "logger":"Akka.Event.DummyClassForStringSources", "message":Tried to associate with unreachable remote address [akka.tcp://myapp@lighthouse-0:7892]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: [Association failed with akka.tcp://myapp@lighthouse-0:7892] Caused by: [System.AggregateException: One or more errors occurred. (Connection refused akka.tcp://myapp@lighthouse-0:7892) ---> Akka.Remote.Transport.InvalidAssociationException: Connection refused akka.tcp://myapp@lighthouse-0:7892 at Akka.Remote.Transport.DotNetty.TcpTransport.AssociateInternal(Address remoteAddress) at Akka.Remote.Transport.DotNetty.DotNettyTransport.Associate(Address remoteAddress) --- End of inner exception stack trace --- at System.Threading.Tasks.Task1.GetResultCore(Boolean waitCompletionNotification) at Akka.Remote.Transport.ProtocolStateActor.<>c.<InitializeFSM>b__12_18(Task
1 result) at System.Threading.Tasks.ContinuationResultTaskFromResultTask`2.InnerInvoke() at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
There are other "Now supervising", "Stopping" "Started" logs which I am omitting here.
Can you please verify if the HCON config is correct for split brain resolver and Sharding?
I think LightHouse/SeeNodes should not have the sharding configuration specified. I think it is a mistake. I also think, split brain resolver configuration might be wrong in LightHouse/SeedNodes and should not be specified for seed nodes.
I appreciate your help.
Here is the HOCON for QueueServicer Trimmed
akka {
loggers = ["Akka.Logger.log4net.Log4NetLogger, Akka.Logger.log4net"]
log-config-on-start = on
loglevel = "DEBUG"
actor {
provider = cluster
serializers {
hyperion = "Akka.Serialization.HyperionSerializer, Akka.Serialization.Hyperion"
}
serialization-bindings {
"System.Object" = hyperion
}
}
remote {
dot-netty.tcp {
….
}
}
cluster {
seed-nodes = ["akka.tcp://myapp@lighthouse-0:7892",akka.tcp://myapp@lighthouse-1:7892"]
roles = ["QueueProcessor"]
sharding {
role = "QueueProcessor"
state-store-mode = ddata
remember-entities = true
passivate-idle-entity-after = off
}
downing-provider-class = "Akka.Cluster.SplitBrainResolver, Akka.Cluster"
split-brain-resolver {
active-strategy = keep-majority
stable-after = 20s
keep-majority {
role = "QueueProcessor"
}
}
down-removal-margin = 20s
}
extensions = ["Akka.Cluster.Tools.PublishSubscribe.DistributedPubSubExtensionProvider,Akka.Cluster.Tools"]
}
Here is the HOCON for Lighthouse
akka {
loggers = ["Akka.Logger.log4net.Log4NetLogger, Akka.Logger.log4net"]
log-config-on-start = on
loglevel = "DEBUG"
actor {
provider = cluster
serializers {
hyperion = "Akka.Serialization.HyperionSerializer, Akka.Serialization.Hyperion"
}
serialization-bindings {
"System.Object" = hyperion
}
}
remote {
dot-netty.tcp {
…
}
}
cluster {
seed-nodes = ["akka.tcp://myapp@lighthouse-0:7892",akka.tcp://myapp@lighthouse-1:7892"]
roles = ["lighthouse"]
sharding {
role = "lighthouse"
state-store-mode = ddata
remember-entities = true
passivate-idle-entity-after = off
}
downing-provider-class = "Akka.Cluster.SplitBrainResolver, Akka.Cluster"
split-brain-resolver {
active-strategy = keep-oldest
stable-after = 30s
keep-oldest {
down-if-alone = on
role = "lighthouse"
}
}
}
}
downing-provider-class = "Akka.Cluster.SplitBrainResolver, Akka.Cluster" split-brain-resolver { active-strategy = keep-majority stable-after = 20s } – user738338 Sep 18 '21 at 19:57