.NET sockets vs C++ sockets at high performance

Question

My question is to settle an argument with my co-workers on C++ vs C#.

We have implemented a server that receives a large amount of UDP streams. This server was developed in C++ using asynchronous sockets and overlapped I/O using completion ports. We use 5 completion ports with 5 threads. This server can easily handle a 500 Mbps throughput on a gigabit network without any lost of packets / error (we didn't push our tests farther than 500 Mbps).

We have tried to re-implement the same kind of server in C# and we have not been able to reach the same incoming throughput. We are using asynchronous receive using ReceiveAsync method and a pool of SocketAsyncEventArgs to avoid the overhead of creating new object for every receive call. Each SAEventArgs has a buffer set to it so we do not need to allocate memory for every receive. The pool is very, very large so we can queue more than 100 receive requests. This server is unable to handle an incoming throughput of more than 240 Mbps. Over that limit, we lose some packets in our UDP streams.

My question is this: should I expect the same performance using C++ sockets and C# sockets? My opinion is that it should be the same performance if memory is managed correctly in .NET.

Side question: would anybody know a good article/reference explaining how .NET sockets use I/O completion ports under the hood?

possible duplicate of http://stackoverflow.com/questions/2773488/socket-performance-c-or-c-sharp — Glory Raj, Dec 11 '11 at 16:44
The plumbing is the same. Focus on the code that processes the data instead. — Hans Passant, Dec 11 '11 at 17:03
C++ doesn't have sockets. Are you talking about Posix Sockets, or another networking library? — Skyler Saleh, Dec 11 '11 at 17:04
@RTS Given the nature of this article, that'll be Win32 Sockets. Win32 includes most (of not all) of the POSIX APIs, but also they can be used as handles with other Win32 APIs including all the Win32 asynchronous operation support. — Richard, Dec 11 '11 at 17:26
@RTS, JohnSaunders: Why the nit-picking on details regarding types of sockets? Everyone can understand that it's win32 sockets and .NET sockets since he talks about C# and IO Completion ports.... — jgauffin, Dec 11 '11 at 17:55
@mdarsigny: How do you allocate the buffers for The SocketAsyncEventArgs? And when? How do you allocate the buffers for the sends? — jgauffin, Dec 11 '11 at 17:57
@jgauffin It wasn't clear to me that he was referring to win32 sockets. — Skyler Saleh, Dec 11 '11 at 20:33
@RTS: By "C++ sockets", I meant Windows Socket (winsock). I thought my explanation was long enough so I kept the details at a bare minimum... — mdarsigny, Dec 13 '11 at 00:05
@jgauffin: Prior to any socket operation, we create several SocketAsyncEventArgs and, for each of them, we allocate a memory buffer and set it to the SAEvtArgs through the SetBuffer() fonction. Then we call 100 ReceiveAsync with a different SAEvtArgs in them. When a read succeeds, we redo a ReceiveAsync using the same SAEvtArgs. We currently don't do anything with the received data, we just want to see what is the max incomming throughput we can get. — mdarsigny, Dec 13 '11 at 00:08
@RTS,@JohnSaunders,@jgauffin: Concerning precision: C++ Socket ==> using Windows socket. Receive is done using WSARecv and WSAOVERLAPPED. Socket is attached to an IO Completion port. (similar to http://www.codeproject.com/KB/IP/winsockiocp.aspx) C# Socket ==> using System.Net.Sockets.Socket. We are using the ReceiveAsync() method with a pool of SocketAsyncEventArgs & Buffer. The goal is to prevent allocation/reallocation of memory. — mdarsigny, Dec 13 '11 at 00:21
@mdarsigny Which CPU & how much memory was used to run tests? Did you take Packets/sec metrics, if not, can you tell us about the size of an average packet used? Also, did you guys ever figure out a way to improve performance of C# networking code to match C++? — tunafish24, Apr 03 '14 at 03:42

score 8 · Answer 1 · answered Dec 11 '11 at 17:04

8

would anybody know a good article/reference explaining how .NET sockets use I/O completion ports under the hood?

I suspect the only reference would be the implementation (ie. Reflector or other assembly de-compiler). With that you will find that all asynchronous IO goes through an IO Completion Port with call backs being processed in the IO-thread pool (which is separate to the normal thread pool).

use 5 completion ports

I would expect to use a single completion port processing all the IO into a single pool of threads with one thread per pool servicing completions (assuming you are doing any other IO, including disk, asynchronously as well).

Multiple completion ports would make sense if you have some form of prioritisation going on.

My question is this: should I expect the same performance using C++ sockets and C# sockets?

Yes or no, depending on how narrowly you define the "using ... sockets" part. In terms of the operations from the start of the asynchronous operation until the completion is posted to the completion port I would expect no significant difference (all the processing is in the Win32 API or Windows kernel).

However the safety that the .NET runtime provides will add some overhead. Eg. buffer lengths will be checked, delegates validated etc. If the limit on the application is CPU then this is likely to make a difference, and at the extreme a small difference can easily add up.

Also the .NET version will occasionally pause for GC (.NET 4.5 does asynchronous collection, so this will get better in the future). There are techniques to minimise garbage accumulating (eg. reuse objects rather than creating them, make use of structures while avoiding boxing).

In the end, if the C++ version works and is meeting your performance needs, why port?

answered Dec 11 '11 at 17:04

Richard

106,783
21
203
265

+1 According to Telerik JustDecompile, `SocketAsyncEventArgs` uses `System.Threading.Overlapped`, the implementation of which tells pretty much everything about how iocompletion ports are used. – kol Dec 11 '11 at 17:09
Thanks for the detailed answer. I have looked at the Socket code using reflector and the .NET socket uses IO completion ports when doing async operation. This makes me more confused as I would have expected similar performance as my unmanaged code since both of them rely on the same technology... – mdarsigny Dec 13 '11 at 00:26
In C++, I use 5 completion port because I have a thread pool of 5 threads waiting on these ports (1 thread per IOCompPort). When data arrives on the port, the thread "wakes up", read the data, post it on another worker thread and then listen again on the port for data. So this is done 5 times in parallel. – mdarsigny Dec 13 '11 at 00:29
In C#, we have stripped our server to the bare minimum just to validate if we can receive the same throughput as our C++ application. We wanted to prevent (or rather minimize) GC operations (by pooling a lot of objects). One thing we haven't try though would be to pin the memory to prevent it from being move in memory... This is something we will look at. – mdarsigny Dec 13 '11 at 00:32
The idea of porting this code was that we had the same need for large incomming UDP throughput in a 100% managed project. We wanted to avoid interop marshalling so we assumed we could get the same performance in .NET as what we had in C#. Unfortunately, the theory & practice were different. I am sure we can acheive the same performance in .NET... We just need to figure out what we either not doing or doing wrong. – mdarsigny Dec 13 '11 at 00:34
@mdarsigny A *single* IO Completion port will do that across a number of threads (by default one per logical processor). It will also wake up the most recently used thread helping maximise cache. This fully supports concurrent dispatch of pending IO completions. – Richard Dec 13 '11 at 10:27

score 6 · Answer 2 · answered Dec 11 '11 at 18:25

You can't do a straight port of the code from C++ to C# and expect the same performance. .NET does a lot more than C++ when it comes to memory management (GC) and making sure that your code is safe (boundary checks etc).

I would allocate one large buffer for all IO operations (for instance 65535 x 500 = 32767500 bytes) and then assign a chunk to each SocketAsyncEventArgs (and for send operations). Memory is cheaper than CPU. Use a buffer manager / factory to provide chunks for all connections and IO operations (Flyweight pattern). Microsoft does this in their Async example.

Both Begin/End and Async methods uses IO completion ports in the background. The latter doesn't need to allocate objects for each operation which boosts performance.

You are right. Our code is not an exact port... That would not work well. The requirements are the same though: a server app that queues "n" UDP socket async read. The implementation is different in both language. I like the "Jumbo buffer" suggestion. This is something that we will try. — mdarsigny, Dec 13 '11 at 00:37

score 1 · Answer 3 · answered Dec 11 '11 at 17:29

My guess is that you're not seeing the same performance because .NET and C++ are actually doing different things. Your C++ code may not be as safe, or check boundaries. Also, are you simply measuring the ability to receive the packets without any processing? Or does your throughput include packet processing time? If so, then the code you may have written to process the packets may not be as efficient.

I'd suggest using a profiler to check where the most time is being spent and trying to optimize that. The actual socket code should be quite performant.

.NET sockets vs C++ sockets at high performance

3 Answers3