We have a Service Fabric StatefulService
that is running nicely. It consumes messages, processes them and has two IReliableState
s for storing away some data from each message. It will process about 500 messages per minute per replica.
For each message that comes into the MessageProcessor
, we create a new ITransaction
using IReliableStateManager
that is wrapped in a using block, and pass that transaction to the OuterMessageHandler
. After the message has been processed, we do a ITransaction.CommitAsync
or ITransaction.Abort
if something has failed.
OuterMessageHandler
looks like this:
public async Task Handle(ITransaction tx, params Envelope[] messages)
{
foreach (var msg in messages)
{
using (var scope = _scope.BeginLifetimeScope())
{
var contextProvider = scope.Resolve<MessageContextProvider>();
contextProvider.Set(tx);
await innerHandler.Handle(msg);
}
}
}
MessageContextProvider
is simply the below:
internal class MessageContextProvider
{
private ITransaction _tx;
public void Set(ITransaction tx)
{
_tx = tx;
}
public ITransaction GetTransaction()
{
if (_tx == null)
throw new Exception("ITransaction has not been set");
return _tx;
}
}
This is registered with Autofac
:
builder
.Register(c =>
{
var context = c.Resolve<MessageContextProvider>();
return context.GetTransaction();
})
.As<ITransaction>()
.ExternallyOwned();
builder
.RegisterType<MessageContextProvider>()
.AsSelf()
.InstancePerLifetimeScope();
MessageContextProvider
exists simply to allow us to use ITransaction
in all of the InnerMessageHandler
as if it's just a normal dependency, while still just using the same transaction in each message.
ITransaction
is marked as ExternallyOwned
so that the OuterMessageHandler
does not dispose of it before we do a commit within MessageProcessor
.
InnerMessageHandler
s are just handlers to execute our business logic.
IReliableState
has 2 methods SomeType Find(long id)
and Update(long id, SomeType someType)
.
Find method looks like the below:
var snapshotHandler = await _stateManager.GetOrAddAsync<IReliableDictionary<long, SomeType>>(SomeTypeKey);
var snapshot = await snapshotHandler.GetOrAddAsync(
_transaction,
Id,
new SomeType());
return snapshot;
Update looks like this:
var snapshotHandler = await _stateManager.GetOrAddAsync<IReliableDictionary<long, SomeType>>(SomeTypeKey);
await snapshotHandler.SetAsync(_transaction, Id, snapshot);
When we throw a lot of requests at the service, all replicas stay well below 1% CPU usage. After about an hour, one of the replicas gets to around 30%-35%. When we stop the tests from hitting the service (i.e., the service is now sitting idle), CPU usage still stays between 30-35%. It would be fine if we had spikes and went back down, but continuous high CPU usage is the problem.
From our investigation, we have replaced the 2 IReliableState's with just 2 in memory ConcurrentDictionary's. This solved the issue. We could run it for hours and nothing will rise above 2% CPU usage. This is obviously not a solution as we need the internal state persisted for resilient reasons.
We have used PerfView and dotTrace to see what is happening and not much valid information has been coming up.
At this point I believe it is something to do with how we use either IReliableDictionary or ITransaction. Anyone having similar issues? Could anyone shed some light on what we could possibly be doing wrong?
Edit
One of the ReliableState
repositories (lets call it state2
) had the Find method slightly differently. It looked like the below:
var snapshotHandler = await _stateManager.GetOrAddAsync<IReliableDictionary<long, SomeType2>>(SomeTypeKey2);
var snapshot = await snapshotHandler.TryGetValueAsync(
_transaction,
Id);
if(snapshot.HasValue)
return snapshot.Value.Clone(); // Clone is a deep copy
var stateFromSomeApi = await someApi.GetStartState(id);
await snapshotHandler.SetValue(_transaction, Id, stateFromSomeApi);
return stateFromSomeApi ;
We changed it to do the following:
public async Task<SomeType2> Find(long id)
{
var snapshotHandler = await _stateManager.GetOrAddAsync<IReliableDictionary<long, SomeType2>>(
SomeTypeKey2);
var snapshot = await snapshotHandler.GetOrAddAsync(
_transaction,
id,
await GetStartState(id));
return snapshot.Clone();
}
private async Task<SomeType2> GetStartState(long id)
{
return await _someApi.GetStartState(id);
}
We changed state2
to do a GetOrAddAsync
and it is working perfectly fine. Why would doing TryGetValueAsync
and SetValue
instead of GetOrAddAsync
have threads hang around so much? We have tested it with nearly double the production load and CPU stays well below 5% on each primary replica.