2

We have bought a lot of licenses, done a lot of testing with a lot of promissing results and are on the brink of our first release :).

But now we have hit a big bump on the road, meaning we might have to abandon the bus, if we cant explain and fix it :/.

Our Distributor suddenly have control error messages like the one below:

<?xml version="1.0"?>
<ArrayOfHeaderInfo xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
    <HeaderInfo>
        <Key>NServiceBus.ControlMessage</Key>
        <Value>True</Value>
    </HeaderInfo>
    <HeaderInfo>
        <Key>NServiceBus.Distributor.WorkerCapacityAvailable</Key>
        <Value>20</Value>
    </HeaderInfo>
    <HeaderInfo>
        <Key>NServiceBus.Distributor.WorkerStarting</Key>
        <Value>True</Value>
    </HeaderInfo>
    <HeaderInfo>
        <Key>CorrId</Key>
        <Value>58dd98f5-9ac0-44fb-8604-3a0f06787a35\295075</Value>
    </HeaderInfo>
    <HeaderInfo>
        <Key>NServiceBus.ExceptionInfo.Reason</Key>
        <Value>ProcessingFailed</Value>
    </HeaderInfo>
    <HeaderInfo>
        <Key>NServiceBus.ExceptionInfo.ExceptionType</Key>
        <Value>System.InvalidOperationException</Value>
    </HeaderInfo>
    <HeaderInfo>
        <Key>NServiceBus.ExceptionInfo.HelpLink</Key>
    </HeaderInfo>
    <HeaderInfo>
        <Key>NServiceBus.ExceptionInfo.Message</Key>
        <Value>Property ResponseQueue was not retrieved when receiving the message. Ensure that the PropertyFilter is set correctly.</Value>
    </HeaderInfo>
    <HeaderInfo>
        <Key>NServiceBus.ExceptionInfo.Source</Key>
        <Value>NServiceBus.Core</Value>
    </HeaderInfo>
    <HeaderInfo>
        <Key>NServiceBus.ExceptionInfo.StackTrace</Key>
        <Value>   at NServiceBus.Unicast.Transport.Transactional.TransactionalTransport.ProcessMessage(TransportMessage m) in c:\BuildAgent\work\nsb.master_6\src\impl\unicast\transport\NServiceBus.Unicast.Transport.Transactional\TransactionalTransport.cs:line 312
   at NServiceBus.Unicast.Transport.Transactional.TransactionalTransport.ReceiveMessage() in c:\BuildAgent\work\nsb.master_6\src\impl\unicast\transport\NServiceBus.Unicast.Transport.Transactional\TransactionalTransport.cs:line 275
   at NServiceBus.Utils.TransactionWrapper.RunInTransaction(Action callback, IsolationLevel isolationLevel, TimeSpan transactionTimeout) in c:\BuildAgent\work\nsb.master_6\src\utils\TransactionWrapper.cs:line 32
   at NServiceBus.Unicast.Transport.Transactional.TransactionalTransport.Process() in c:\BuildAgent\work\nsb.master_6\src\impl\unicast\transport\NServiceBus.Unicast.Transport.Transactional\TransactionalTransport.cs:line 220</Value>
    </HeaderInfo>
    <HeaderInfo>
        <Key>NServiceBus.OriginalId</Key>
        <Value>58dd98f5-9ac0-44fb-8604-3a0f06787a35\295075</Value>
    </HeaderInfo>
    <HeaderInfo>
        <Key>NServiceBus.FailedQ</Key>
        <Value>someservice.processId.distributor.control@testservices01</Value>
    </HeaderInfo>
    <HeaderInfo>
        <Key>NServiceBus.TimeOfFailure</Key>
        <Value>2013-04-30 10:07:40:750707 Z</Value>
    </HeaderInfo>
</ArrayOfHeaderInfo>

Google tells us that this might relate to some threading issues and maybe even the way that NSB is implemented using peek/receive.

The above exception relates to this file at GitHub: https://github.com/NServiceBus/NServiceBus/blob/master/src/impl/unicast/transport/NServiceBus.Unicast.Transport.Transactional/TransactionalTransport.cs

Details about our implementation:

We use a custom IManageUnitsOfWork, due to some legacy issue which means no DTC against the DB yet. I do not think that this could be the cause, but i think its worth mentioning. This is the implemenation:

public class ManagedUnitOfWorkWithDtcSuppression : IManageUnitsOfWork
{
    private readonly IContainer _container;
    private IUnitOfWork _unitOfWork;
    private readonly TransactionScope _scope;

    public ManagedUnitOfWorkWithDtcSuppression()
    {
        _scope = new TransactionScope(TransactionScopeOption.Suppress);
        _container = ObjectFactory.GetInstance<IContainer>();
    }

    public void Begin()
    {
        _unitOfWork = _container.GetInstance<IUnitOfWork>();
    }

    public void End(Exception exception = null)
    {
        if (exception == null)
        {
            _unitOfWork.Commit();
        }

        _unitOfWork.Dispose();
        _scope.Complete();
        _scope.Dispose();
    }
}

Also we have a special setup where we run 4 identical AppDomains inside 1 service running, which means that when we start the service as a distributor, there are actually 4 distributors running. But these are pr. definition completely isolated from one another. The IBus is unique for each AppDomain, this has been tested.

Our Distributor configuration looks like this:

        return NServiceBus.Configure.With()
            .DefineEndpointName(queuePrefix)
            .Log4Net(ObjectFactory.GetInstance<IServiceBusLog>().Build())
            .StructureMapBuilder()
            .JsonSerializer()
            .AsMasterNode()
            .RunDistributorWithNoWorkerOnItsEndpoint()
            .MsmqTransport()
            .IsTransactional(true)
            .DisableTimeoutManager()
            .DisableSecondLevelRetries()
            .UnicastBus()
            .CreateBus()
            .Start(() => NServiceBus.Configure.Instance.ForInstallationOn<NServiceBus.Installation.Environments.Windows>().Install());

Question:

What is going on here?

Are we screwing with NSB because we are using DTC supression, is there a MSMQ bug or is there a NSB bug?

Christian Mikkelsen
  • 1,661
  • 2
  • 19
  • 44
  • Which version of NServiceBus is this on? – Udi Dahan Apr 30 '13 at 13:15
  • Version 3.3.5 of the bus. – Christian Mikkelsen Apr 30 '13 at 13:18
  • Units of work doesn't come into play for the control messages going to the distributor. Something seems to be wrong with either the control queue or the control message coming from the worker. Can you start one of the workers and use Queue Explorer to dump the control message to file so that we can take a look? – Andreas Öhlund Apr 30 '13 at 20:35
  • Of course :). I still have the messages from the last time this happened and will send them as I do not know how to recreate the incident yet. Where do you want them, at the official NSB support email, dropbox or? – Christian Mikkelsen Apr 30 '13 at 21:29
  • One thing that seems odd is the fact that the 3 control error msg's I have all have identical first parts of the CorrId (58dd98f5-9ac0-44fb-8604-3a0f06787a35/XXXXXX), allthough they are running in seperate AppDomains, within the same service. We are creating/starting the bus using StructureMap and each bus have a unique Hashcode in their app domain. Could this be the source of the issue? Is there a seperate "singleton" distributor process behind the curtains which might cause issues with our setup (wild guess)? – Christian Mikkelsen Apr 30 '13 at 21:41
  • Heres a dropbox link: https://www.dropbox.com/sh/m1yu0goswensnhn/IadEuRzQL9 – Christian Mikkelsen May 01 '13 at 05:39
  • Was there ever an answer to this? – Andreas Jan 06 '14 at 14:15
  • Well sort of. We don't use the distributor anymore due to this issue among other things and haven't seen this error since, at all. Also we have some major workloads which required us to implement an upper threshold on the number of messages on the queues, which prevents further processing until the queue is below the threshold again. This means that any errors caused by stress are kept to a minimum and by doing so we might have "resolved" the issue descriped above. So unfortunately I cannot give an answer to how to solve this issue. – Christian Mikkelsen Jan 07 '14 at 11:10

1 Answers1

0

Let me jump in with an answer a year late! :) I'm pretty sure you're seeing https://github.com/Particular/NServiceBus/pull/2250. Basically Microsoft made changes to the implementation of MessageQueue between .NET 3.5 and .NET 4, making the NSB code non-thread safe. This was fixed in https://github.com/Particular/NServiceBus/releases/tag/3.3.10

janovesk
  • 1,138
  • 7
  • 14