Service Fabric StatefulService CPU usage keeps growing

Question

We have a Service Fabric StatefulService that is running nicely. It consumes messages, processes them and has two IReliableStates for storing away some data from each message. It will process about 500 messages per minute per replica.

For each message that comes into the MessageProcessor, we create a new ITransaction using IReliableStateManager that is wrapped in a using block, and pass that transaction to the OuterMessageHandler. After the message has been processed, we do a ITransaction.CommitAsync or ITransaction.Abort if something has failed.

OuterMessageHandler looks like this:

    public async Task Handle(ITransaction tx, params Envelope[] messages)
    {
        foreach (var msg in messages)
        {
            using (var scope = _scope.BeginLifetimeScope())
            {
                var contextProvider = scope.Resolve<MessageContextProvider>();

                contextProvider.Set(tx);

                await innerHandler.Handle(msg);
            }
        }
    }

MessageContextProvider is simply the below:

internal class MessageContextProvider
{
    private ITransaction _tx;

    public void Set(ITransaction tx)
    {
        _tx = tx;
    }

    public ITransaction GetTransaction()
    {
        if (_tx == null)
            throw new Exception("ITransaction has not been set");

        return _tx;
    }
}

This is registered with Autofac:

         builder
            .Register(c =>
            {
                var context = c.Resolve<MessageContextProvider>();

                return context.GetTransaction();
            })
            .As<ITransaction>()
            .ExternallyOwned();

        builder
            .RegisterType<MessageContextProvider>()
            .AsSelf()
            .InstancePerLifetimeScope();

MessageContextProvider exists simply to allow us to use ITransaction in all of the InnerMessageHandler as if it's just a normal dependency, while still just using the same transaction in each message. ITransaction is marked as ExternallyOwned so that the OuterMessageHandler does not dispose of it before we do a commit within MessageProcessor.

InnerMessageHandlers are just handlers to execute our business logic.

IReliableState has 2 methods SomeType Find(long id) and Update(long id, SomeType someType).

Find method looks like the below:

       var snapshotHandler = await _stateManager.GetOrAddAsync<IReliableDictionary<long, SomeType>>(SomeTypeKey);

        var snapshot = await snapshotHandler.GetOrAddAsync(
            _transaction,
            Id,
            new SomeType());

        return snapshot;

Update looks like this:

        var snapshotHandler = await _stateManager.GetOrAddAsync<IReliableDictionary<long, SomeType>>(SomeTypeKey);

        await snapshotHandler.SetAsync(_transaction, Id, snapshot);

When we throw a lot of requests at the service, all replicas stay well below 1% CPU usage. After about an hour, one of the replicas gets to around 30%-35%. When we stop the tests from hitting the service (i.e., the service is now sitting idle), CPU usage still stays between 30-35%. It would be fine if we had spikes and went back down, but continuous high CPU usage is the problem.

From our investigation, we have replaced the 2 IReliableState's with just 2 in memory ConcurrentDictionary's. This solved the issue. We could run it for hours and nothing will rise above 2% CPU usage. This is obviously not a solution as we need the internal state persisted for resilient reasons.

We have used PerfView and dotTrace to see what is happening and not much valid information has been coming up.

At this point I believe it is something to do with how we use either IReliableDictionary or ITransaction. Anyone having similar issues? Could anyone shed some light on what we could possibly be doing wrong?

Edit One of the ReliableState repositories (lets call it state2) had the Find method slightly differently. It looked like the below:

       var snapshotHandler = await _stateManager.GetOrAddAsync<IReliableDictionary<long, SomeType2>>(SomeTypeKey2);

        var snapshot = await snapshotHandler.TryGetValueAsync(
            _transaction,
            Id);

        if(snapshot.HasValue)
             return snapshot.Value.Clone(); // Clone is a deep copy 

       var stateFromSomeApi = await someApi.GetStartState(id);

       await snapshotHandler.SetValue(_transaction, Id, stateFromSomeApi);

        return stateFromSomeApi ;

We changed it to do the following:

    public async Task<SomeType2> Find(long id)
    {
        var snapshotHandler = await _stateManager.GetOrAddAsync<IReliableDictionary<long, SomeType2>>(
            SomeTypeKey2);

        var snapshot = await snapshotHandler.GetOrAddAsync(
            _transaction,
            id,
            await GetStartState(id));

       return snapshot.Clone();
    }

    private async Task<SomeType2> GetStartState(long id)
    {
        return await _someApi.GetStartState(id);
    }

We changed state2 to do a GetOrAddAsync and it is working perfectly fine. Why would doing TryGetValueAsync and SetValue instead of GetOrAddAsync have threads hang around so much? We have tested it with nearly double the production load and CPU stays well below 5% on each primary replica.

`When we stop the tests from hitting the service (ie: service is now sitting idle) CPU usage still stays between 30-35%.` What is RAM usage at this time? Are you able to profile to establish what the 30-35% CPU usage is actually doing? — mjwills, Aug 25 '18 at 11:46
`We changed state2 to do a GetOrAddAsync and it is working perfectly fine.` You will need to show us that code so we can compare directly the working vs non-working code. — mjwills, Aug 25 '18 at 13:48
The RAM usage stays constant actually. So primary stays below 300mb and active secondaries at around 60mb. — Cosie SicLovan, Aug 27 '18 at 11:49
`We changed state2 to do a GetOrAddAsync and it is working perfectly fine. You will need to show us that code so we can compare directly the working vs non-working code.` I updated the submission with what the code looks like now. — Cosie SicLovan, Aug 27 '18 at 11:58
Did you use Timeline profiling mode in dotTrace? Is there an intensive GC when you see high CPU usage? — KonKat, Aug 28 '18 at 13:39

Service Fabric StatefulService CPU usage keeps growing

0 Answers0