I have a Service Fabric cluster running multiple applications and each application consists of multiple (stateful and stateless) services. 2 of these services (both stateful) regularly have issues where some partition's replica's are stuck with the message:
'System.RAP' reported Warning for property 'IStatefulServiceReplica.OpenDuration'. The api IStatefulServiceReplica.Open on node XXX is stuck.
or:
'System.RA' reported Warning for property 'ReplicaOpenStatus'. Replica had multiple failures during open on XXX. The application host has crashed. For more information see: https://aka.ms/sfhealth
This is what Service Fabric Explorer looks like:
The issue is not related to the node the replica is running on, but it seems to occur more frequently on some partitions than on other.
While investigating the logs, I got a more detailed description of what is going wrong, for example:
Application: Subscriptions.exe
CoreCLR Version: 6.0.21.52210
.NET Version: 6.0.0
Description: The application requested process termination through System.Environment.FailFast(string message).
Message: GetActiveStateProvider: Stateprovider id 133038455316733741 is not present in the stateprovider-id map
Stack:
at System.Environment.FailFast(System.String)
at Microsoft.ServiceFabric.Replicator.Utility.FailFast(System.Guid, Int64, System.String)
at Microsoft.ServiceFabric.Replicator.Utility.AssertHelper(Microsoft.ServiceFabric.Replicator.ITracer, System.String, System.Object[])
at Microsoft.ServiceFabric.Replicator.Utility.Assert[[System.Int64, System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]](Boolean, Microsoft.ServiceFabric.Replicator.ITracer, System.String, Int64)
at Microsoft.ServiceFabric.Replicator.StateProviderMetadataManager.GetActiveStateProvider(Int64)
at Microsoft.ServiceFabric.Replicator.DynamicStateManager+<OnApplyAsync>d__128.MoveNext()
at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[[Microsoft.ServiceFabric.Replicator.DynamicStateManager+<OnApplyAsync>d__128, Microsoft.ServiceFabric.Data.Impl, Version=9.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35]](<OnApplyAsync>d__128 ByRef)
at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1[[System.__Canon, System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]].Start[[Microsoft.ServiceFabric.Replicator.DynamicStateManager+<OnApplyAsync>d__128, Microsoft.ServiceFabric.Data.Impl, Version=9.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35]](<OnApplyAsync>d__128 ByRef)
at Microsoft.ServiceFabric.Replicator.DynamicStateManager.OnApplyAsync(Int64, Microsoft.ServiceFabric.Replicator.TransactionBase, System.Fabric.OperationData, System.Fabric.OperationData, Microsoft.ServiceFabric.Replicator.ApplyContext, Int64)
at Microsoft.ServiceFabric.Replicator.DynamicStateManager+<OnApplyAsync>d__127.MoveNext()
at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[[Microsoft.ServiceFabric.Replicator.DynamicStateManager+<OnApplyAsync>d__127, Microsoft.ServiceFabric.Data.Impl, Version=9.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35]](<OnApplyAsync>d__127 ByRef)
at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1[[System.__Canon, System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]].Start[[Microsoft.ServiceFabric.Replicator.DynamicStateManager+<OnApplyAsync>d__127, Microsoft.ServiceFabric.Data.Impl, Version=9.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35]](<OnApplyAsync>d__127 ByRef)
at Microsoft.ServiceFabric.Replicator.DynamicStateManager.OnApplyAsync(Int64, Microsoft.ServiceFabric.Replicator.TransactionBase, System.Fabric.OperationData, System.Fabric.OperationData, Microsoft.ServiceFabric.Replicator.ApplyContext)
at Microsoft.ServiceFabric.Replicator.OperationProcessor+<ApplyCallback>d__36.MoveNext()
at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[[Microsoft.ServiceFabric.Replicator.OperationProcessor+<ApplyCallback>d__36, Microsoft.ServiceFabric.Data.Impl, Version=9.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35]](<ApplyCallback>d__36 ByRef)
at System.Runtime.CompilerServices.AsyncTaskMethodBuilder.Start[[Microsoft.ServiceFabric.Replicator.OperationProcessor+<ApplyCallback>d__36, Microsoft.ServiceFabric.Data.Impl, Version=9.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35]](<ApplyCallback>d__36 ByRef)
at Microsoft.ServiceFabric.Replicator.OperationProcessor.ApplyCallback(Microsoft.ServiceFabric.Replicator.LogRecord)
at Microsoft.ServiceFabric.Replicator.OperationProcessor+<ProcessLoggedRecordAsync>d__32.MoveNext()
at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[[Microsoft.ServiceFabric.Replicator.OperationProcessor+<ProcessLoggedRecordAsync>d__32, Microsoft.ServiceFabric.Data.Impl, Version=9.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35]](<ProcessLoggedRecordAsync>d__32 ByRef)
at System.Runtime.CompilerServices.AsyncTaskMethodBuilder.Start[[Microsoft.ServiceFabric.Replicator.OperationProcessor+<ProcessLoggedRecordAsync>d__32, Microsoft.ServiceFabric.Data.Impl, Version=9.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35]](<ProcessLoggedRecordAsync>d__32 ByRef)
at Microsoft.ServiceFabric.Replicator.OperationProcessor.ProcessLoggedRecordAsync(Microsoft.ServiceFabric.Replicator.LogRecord)
at Microsoft.ServiceFabric.Replicator.LogRecordsDispatcher+<ProcessTransaction>d__21.MoveNext()
at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[[Microsoft.ServiceFabric.Replicator.LogRecordsDispatcher+<ProcessTransaction>d__21, Microsoft.ServiceFabric.Data.Impl, Version=9.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35]](<ProcessTransaction>d__21 ByRef)
at System.Runtime.CompilerServices.AsyncTaskMethodBuilder.Start[[Microsoft.ServiceFabric.Replicator.LogRecordsDispatcher+<ProcessTransaction>d__21, Microsoft.ServiceFabric.Data.Impl, Version=9.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35]](<ProcessTransaction>d__21 ByRef)
at Microsoft.ServiceFabric.Replicator.LogRecordsDispatcher.ProcessTransaction(System.Collections.Generic.List`1<Microsoft.ServiceFabric.Replicator.TransactionLogRecord>)
at Microsoft.ServiceFabric.Replicator.LogRecordsDispatcher+<ProcessSpawnedTransaction>d__20.MoveNext()
at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[[Microsoft.ServiceFabric.Replicator.LogRecordsDispatcher+<ProcessSpawnedTransaction>d__20, Microsoft.ServiceFabric.Data.Impl, Version=9.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35]](<ProcessSpawnedTransaction>d__20 ByRef)
at System.Runtime.CompilerServices.AsyncTaskMethodBuilder.Start[[Microsoft.ServiceFabric.Replicator.LogRecordsDispatcher+<ProcessSpawnedTransaction>d__20, Microsoft.ServiceFabric.Data.Impl, Version=9.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35]](<ProcessSpawnedTransaction>d__20 ByRef)
at Microsoft.ServiceFabric.Replicator.LogRecordsDispatcher.ProcessSpawnedTransaction(System.Object)
at System.Threading.Tasks.Task`1[[System.__Canon, System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]].InnerInvoke()
at System.Threading.Tasks.Task+<>c.<.cctor>b__271_0(System.Object)
at System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(System.Threading.Thread, System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
at System.Threading.Tasks.Task.ExecuteWithThreadLocal(System.Threading.Tasks.Task ByRef, System.Threading.Thread)
at System.Threading.Tasks.Task.ExecuteEntryUnsafe(System.Threading.Thread)
at System.Threading.Tasks.Task.ExecuteFromThreadPool(System.Threading.Thread)
at System.Threading.ThreadPoolWorkQueue.Dispatch()
at System.Threading.PortableThreadPool+WorkerThread.WorkerThreadStart()
at System.Threading.Thread.StartCallback()
However, the internet does not reveal a lot of information about GetActiveStateProvider: Stateprovider id XXX is not present in the stateprovider-id map
.
What could be the reason behind the failing replicator? And why is the Stateprovider failing sometimes and not always?
Service Fabric version: 9.0.1048.9590