1

I am working on a proof of concept implementation of NServiceBus v4.x for work.

Right now I have two subscribers and a single publisher.

The publisher can publish over 500 message per second. It runs great.

Subscriber A runs without distributors/workers. It is a single process.

Subscriber B runs with a single distributor powering N number of workers.

In my test I hit an endpoint that creates and publishes 100,000 messages. I do this publish with the subscribers off line.

Subscriber A processes a steady 100 messages per second. Subscriber B with 2+ workers (same result with 2, 3, or 4) struggles to top 50 messages per second gross across all workers.

It seems in my scenario that the workers (which I ramped up to 40 threads per worker) are waiting around for the distributor to give them work.

Am I missing something possibly that is causing the distributor to be throttled? All Buses are running an unlimited Dev license.

System Information: Intel Core i5 M520 @ 2.40 GHz 8 GBs of RAM SSD Hard Drive

UPDATE 08/06/2013: I finished deploying the system to a set of servers. I am experiencing the same results. Every server with a worker that I add decreases the performance of the subscriber.

Subscriber B has a distributor on one server and two additional servers for workers. With Subscriber B and one server with an active worker I am experiencing ~80 messages/events per second. Adding in another worker on an additional physical machine decreases that to ~50 messages per second. Also, these are "dummy messages". No logic actually happens in the handlers other than a log of the message through log4net. Turning off the logging doesn't increase performance.

Suggestions?

0xElGato
  • 353
  • 2
  • 10

3 Answers3

3

If you're scaling out with NServiceBus master/worker nodes on one server, then trying to measure performance is meaningless. One process with multiple threads will always do better than a distributor and multiple worker nodes on the same machine because the distributor will become a bottleneck while everything is competing for the same compute resources.

If the workers are moved to separate servers, it becomes a completely different story. The distributor is very efficient at doling out messages if that's the only thing happening on the server.

Give it a try with multiple servers and see what happens.

David Boike
  • 18,545
  • 7
  • 59
  • 94
  • Going to finish up a few additions to the POC (Sagas, for example) and then I'll test it with servers from our IT department. – 0xElGato Jul 24 '13 at 20:01
  • Still setting up the server environments. Almost done. Should have an update by early next week. – 0xElGato Jul 31 '13 at 16:52
  • I have completed setting up the servers and updated my original post with details. – 0xElGato Aug 06 '13 at 14:30
  • You should monitor network bandwith, disk io, disk queue length, cpu usages on all servers. What kind of network throughput do you get between servers? Are they on a private switch? In a decent network you should be able to get at least above 100MB/s. Do you use write caching in windows? Do you have drive encryption enabled? If you process everything with transactions then you are getting lock issues due the amount of workers you create. Can you share the code of the test that you are performing? – Ramon Smits Aug 06 '13 at 16:00
  • How do you write those 100.000 messages? In seperate transactions or one transaction? Writing 100.000 separate transactionale messages already takes quite some time. – Ramon Smits Aug 06 '13 at 16:02
  • The handlers do not do any real work, correct. I will get the rest of the questions answered as soon as I can. – 0xElGato Aug 06 '13 at 16:16
  • 1 - In my testing during monitoring none of the resources ever come close to being maxed out. Does this indicate I need to increase the threads at the distributor and/or worker? 2 - Network throughput: Private VM environment. In my proof it is a 100MB nic on each VM but in our production environment they all run 10 GBE. 3 - During heavy load the network never got above 4-5MB/s. 4 - Do you use write caching in windows? No. 5 - Do you have drive encryption enabled? No. – 0xElGato Aug 07 '13 at 13:38
  • Thanks everyone for the help. I am starting to get some positive progress. I have been tweaking the configs of the distributor and worker and have been able to see an increase in throughput of almost double my original findings. I'm going to keep working on finding the point where I stop gaining and the box starts maxing out or bottlenecking. – 0xElGato Aug 07 '13 at 18:02
1

Rather than have a dummy handler that does nothing, can you simulate actual processing by adding in some sleep time, say 5 seconds. And then compare the results of having a subscriber and through the distributor?

Scaling out (with or without a distributor) is only useful for where the work being done by a single machine takes time and therefore more computing resources helps. To help with this, monitor the CriticalTime performance counter on the endpoint and when you have the need, add in the distributor. Scaling out using the distributor when needed is made easy by not having to change code, just starting the same endpoint in distributor and worker profiles.

Indu Alagarsamy
  • 449
  • 2
  • 4
  • I have already done that. I have two messages "FastEvent" and "SlowEvent". The slow event has a Thread.Sleep of 2 seconds. I will do a scaled out test with this. – 0xElGato Aug 07 '13 at 14:31
  • Sorry, the handler for SlowEvent has a Thread.Sleep. The messages are just POCO. – 0xElGato Aug 07 '13 at 14:53
  • Ok. More importantly, think about the distributor as a means to scale out after/close to exhausting the computing resources on the box. Cheers. – Indu Alagarsamy Aug 07 '13 at 17:21
  • I was able to almost double the throughput of my setup with a single Distributor / Worker by increasing the threads on the distributor and the worker. Neither box is still taxed. I'm going to keep going and see where the point of diminishing returns starts. Thanks. – 0xElGato Aug 07 '13 at 18:00
0

The whole chain is transactional. You are paying heavy for this. Increasing the workload across machines will really not increase performance when you do not have very fast disk storage with write through caching to speed up transactional writes.

When you have your poc scaled out to several servers just try to mark a messages as 'Express' which does not do transactional writes in the queue and disable MSDTC on the bus instance to see what kind of performance is possible without transactions. This is not really usable for production unless you know where this is not mandatory or what is capable when you have a architecture which does not require DTC.

Ramon Smits
  • 2,482
  • 1
  • 18
  • 20
  • I will be testing this in a simulated production environment across three servers on Friday. I'll update this thread when I find out more. I'll also try what you said to see how that impacts it. Thanks – 0xElGato Jul 25 '13 at 13:14