1

I have a Service Fabric cluster running multiple applications and each application consists of multiple (stateful and stateless) services. 2 of these services (both stateful) regularly have issues where some partition's replica's are stuck with the message:

'System.RAP' reported Warning for property 'IStatefulServiceReplica.OpenDuration'. The api IStatefulServiceReplica.Open on node XXX is stuck.

or:

'System.RA' reported Warning for property 'ReplicaOpenStatus'. Replica had multiple failures during open on XXX. The application host has crashed. For more information see: https://aka.ms/sfhealth

This is what Service Fabric Explorer looks like:

Service Fabric Exlorer navigation

Replicator Status

The issue is not related to the node the replica is running on, but it seems to occur more frequently on some partitions than on other.

While investigating the logs, I got a more detailed description of what is going wrong, for example:

Application: Subscriptions.exe
CoreCLR Version: 6.0.21.52210
.NET Version: 6.0.0
Description: The application requested process termination through System.Environment.FailFast(string message).
Message: GetActiveStateProvider: Stateprovider id 133038455316733741 is not present in the stateprovider-id map
Stack:
   at System.Environment.FailFast(System.String)
   at Microsoft.ServiceFabric.Replicator.Utility.FailFast(System.Guid, Int64, System.String)
   at Microsoft.ServiceFabric.Replicator.Utility.AssertHelper(Microsoft.ServiceFabric.Replicator.ITracer, System.String, System.Object[])
   at Microsoft.ServiceFabric.Replicator.Utility.Assert[[System.Int64, System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]](Boolean, Microsoft.ServiceFabric.Replicator.ITracer, System.String, Int64)
   at Microsoft.ServiceFabric.Replicator.StateProviderMetadataManager.GetActiveStateProvider(Int64)
   at Microsoft.ServiceFabric.Replicator.DynamicStateManager+<OnApplyAsync>d__128.MoveNext()
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[[Microsoft.ServiceFabric.Replicator.DynamicStateManager+<OnApplyAsync>d__128, Microsoft.ServiceFabric.Data.Impl, Version=9.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35]](<OnApplyAsync>d__128 ByRef)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1[[System.__Canon, System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]].Start[[Microsoft.ServiceFabric.Replicator.DynamicStateManager+<OnApplyAsync>d__128, Microsoft.ServiceFabric.Data.Impl, Version=9.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35]](<OnApplyAsync>d__128 ByRef)
   at Microsoft.ServiceFabric.Replicator.DynamicStateManager.OnApplyAsync(Int64, Microsoft.ServiceFabric.Replicator.TransactionBase, System.Fabric.OperationData, System.Fabric.OperationData, Microsoft.ServiceFabric.Replicator.ApplyContext, Int64)
   at Microsoft.ServiceFabric.Replicator.DynamicStateManager+<OnApplyAsync>d__127.MoveNext()
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[[Microsoft.ServiceFabric.Replicator.DynamicStateManager+<OnApplyAsync>d__127, Microsoft.ServiceFabric.Data.Impl, Version=9.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35]](<OnApplyAsync>d__127 ByRef)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1[[System.__Canon, System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]].Start[[Microsoft.ServiceFabric.Replicator.DynamicStateManager+<OnApplyAsync>d__127, Microsoft.ServiceFabric.Data.Impl, Version=9.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35]](<OnApplyAsync>d__127 ByRef)
   at Microsoft.ServiceFabric.Replicator.DynamicStateManager.OnApplyAsync(Int64, Microsoft.ServiceFabric.Replicator.TransactionBase, System.Fabric.OperationData, System.Fabric.OperationData, Microsoft.ServiceFabric.Replicator.ApplyContext)
   at Microsoft.ServiceFabric.Replicator.OperationProcessor+<ApplyCallback>d__36.MoveNext()
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[[Microsoft.ServiceFabric.Replicator.OperationProcessor+<ApplyCallback>d__36, Microsoft.ServiceFabric.Data.Impl, Version=9.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35]](<ApplyCallback>d__36 ByRef)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder.Start[[Microsoft.ServiceFabric.Replicator.OperationProcessor+<ApplyCallback>d__36, Microsoft.ServiceFabric.Data.Impl, Version=9.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35]](<ApplyCallback>d__36 ByRef)
   at Microsoft.ServiceFabric.Replicator.OperationProcessor.ApplyCallback(Microsoft.ServiceFabric.Replicator.LogRecord)
   at Microsoft.ServiceFabric.Replicator.OperationProcessor+<ProcessLoggedRecordAsync>d__32.MoveNext()
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[[Microsoft.ServiceFabric.Replicator.OperationProcessor+<ProcessLoggedRecordAsync>d__32, Microsoft.ServiceFabric.Data.Impl, Version=9.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35]](<ProcessLoggedRecordAsync>d__32 ByRef)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder.Start[[Microsoft.ServiceFabric.Replicator.OperationProcessor+<ProcessLoggedRecordAsync>d__32, Microsoft.ServiceFabric.Data.Impl, Version=9.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35]](<ProcessLoggedRecordAsync>d__32 ByRef)
   at Microsoft.ServiceFabric.Replicator.OperationProcessor.ProcessLoggedRecordAsync(Microsoft.ServiceFabric.Replicator.LogRecord)
   at Microsoft.ServiceFabric.Replicator.LogRecordsDispatcher+<ProcessTransaction>d__21.MoveNext()
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[[Microsoft.ServiceFabric.Replicator.LogRecordsDispatcher+<ProcessTransaction>d__21, Microsoft.ServiceFabric.Data.Impl, Version=9.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35]](<ProcessTransaction>d__21 ByRef)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder.Start[[Microsoft.ServiceFabric.Replicator.LogRecordsDispatcher+<ProcessTransaction>d__21, Microsoft.ServiceFabric.Data.Impl, Version=9.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35]](<ProcessTransaction>d__21 ByRef)
   at Microsoft.ServiceFabric.Replicator.LogRecordsDispatcher.ProcessTransaction(System.Collections.Generic.List`1<Microsoft.ServiceFabric.Replicator.TransactionLogRecord>)
   at Microsoft.ServiceFabric.Replicator.LogRecordsDispatcher+<ProcessSpawnedTransaction>d__20.MoveNext()
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[[Microsoft.ServiceFabric.Replicator.LogRecordsDispatcher+<ProcessSpawnedTransaction>d__20, Microsoft.ServiceFabric.Data.Impl, Version=9.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35]](<ProcessSpawnedTransaction>d__20 ByRef)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder.Start[[Microsoft.ServiceFabric.Replicator.LogRecordsDispatcher+<ProcessSpawnedTransaction>d__20, Microsoft.ServiceFabric.Data.Impl, Version=9.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35]](<ProcessSpawnedTransaction>d__20 ByRef)
   at Microsoft.ServiceFabric.Replicator.LogRecordsDispatcher.ProcessSpawnedTransaction(System.Object)
   at System.Threading.Tasks.Task`1[[System.__Canon, System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]].InnerInvoke()
   at System.Threading.Tasks.Task+<>c.<.cctor>b__271_0(System.Object)
   at System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(System.Threading.Thread, System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
   at System.Threading.Tasks.Task.ExecuteWithThreadLocal(System.Threading.Tasks.Task ByRef, System.Threading.Thread)
   at System.Threading.Tasks.Task.ExecuteEntryUnsafe(System.Threading.Thread)
   at System.Threading.Tasks.Task.ExecuteFromThreadPool(System.Threading.Thread)
   at System.Threading.ThreadPoolWorkQueue.Dispatch()
   at System.Threading.PortableThreadPool+WorkerThread.WorkerThreadStart()
   at System.Threading.Thread.StartCallback()

However, the internet does not reveal a lot of information about GetActiveStateProvider: Stateprovider id XXX is not present in the stateprovider-id map. What could be the reason behind the failing replicator? And why is the Stateprovider failing sometimes and not always?

Service Fabric version: 9.0.1048.9590

crates_barrels
  • 988
  • 13
  • 15

0 Answers0