How good IOCP scales across multiple threads?

Question

There is no public data on the actual IOCP performance. Does anyone know how good IOCP scales on say 128 cores - does it reach 1 mil. ops/sec on a good hardware?

Since IOCP queue is itself a shared resource between multiple threads which all connections must access, at some point the contention level of IOCP queue should affect the whole IOCP scalability. Any benchmarks on this?

you think if you create several *iocp* and bind files to different *iocp*, create separate thread pool for each *iocp* - this can be faster ? and no any another shared resource between multiple threads ? heap (allocation/free) in user and kernel (pool) space - not shared resource too ? how IOCP scalability is good ? i don't know. but not think that multiple *iocp* will be bettercompare single. anyway more important how frequently packets will be queued to *iocp* (i.e. I/O operation finished). system simply pop some waiting thread at this time. and no matter 1 or 128 threads will be wait on it — RbMm, Jun 13 '20 at 19:12
Thanks for the comment. Yes I think a hybrid architecture with about 4..8..16 cpu cores per every iocp is better than using one iocp for hundreds of cpu threads or using one iocp per one cpu thread(nginx model though they are obv using epoll for linux). Surprisingly no one to date was able to benchmark iocp properly on multi core cpu. one iocp for all threads = too much contention, but one iocp per one thread is bad for load balancing — Dmitry Sychov, Jun 13 '20 at 21:56
i think that performance of *iocp* not depend from thread count which wait on it at all. it depend from packets count which will be queued to it. hard do test for this - many factors can play role. but not only *iocp* is shared resource.. how about heap ? if you your logic exist sense create multiple heaps and bind every thread to self heap - for allocate only from it. etc. however hard give definitive answer which is better solutuion. — RbMm, Jun 13 '20 at 22:08
***one iocp for all threads = too much contention*** - this is 100% wrong. *iocp* is not critical section.. here no any contention at all. this is another object. any count of threads can wait on *iocp* without any contention. simply all this threads will be inserted into *iocp* list entry and all. no contention, no cpu usage, no additional memory etc. when packets arrived to iocp - system pop some thread from wait list (how fast this be not depend from wait thread count) and awake it. contention not per threads here. contentions **per packets** — RbMm, Jun 13 '20 at 22:14
my opinion - doesn't matter how many you have cores and threads. matter how frequently packets will be queued to iocp. stress test need do by packet counts per some time, but not by thread or cpu count — RbMm, Jun 13 '20 at 22:18
Too much contention from cache invalidation/thrashing; sometimes a single variable write-shared across several hundreds of threads is enough to limit performance any further in the generic context setting... — Dmitry Sychov, Jun 13 '20 at 23:05
not think so. which concrete cache invalidation/thrashing you mean ? can any count of threads begin wait on *iocp*. and here no any contections. all it simply wait. and not take cpu time. when packet will be in iocp - system select thread from waiters and pass packet to it. another threads continue wait..only from packet count depend situation — RbMm, Jun 13 '20 at 23:36
For NUMA for example multiple iocps(one for node) seems to be a better choice than having all CPU sockets serve only one iocp; 4 CPU Intel configuration = 4 iocps with threads affinity. By the way I've failed to find any benchmarks with iocp capable of serving at least million on requests/sec; if you know one please let me know; for linux the best i've found is this: https://www.nginx.com/blog/testing-the-performance-of-nginx-and-nginx-plus-web-servers/ I have a fear that innate Windows network stack performance is poor compared to what it could be(in theory). — Dmitry Sychov, Jun 14 '20 at 17:09
It seems to me that your question is incorrect. task not in cores/threads count. but in how many packets will be pushed to iocp per time. and how fast threads can handle packets. for example if thread can handle packet faster then new packed will be pushed - single thread is enough for iocp. all another threads will be infinite waite and not play role. if new packet will be pushed every ~t interval. and thread need ~T time for handle packet - need roughly T/t threads . another case that we can not know t and T and usual try allocate more threads.. — RbMm, Jun 14 '20 at 17:43
so question need be based not on cpu count but - **if packets too frequently pushed to iocp - exist sense use multiple** ? i think that not. simply because exist many another shared resources which you need anyway access from all threads. for example some shared state which you need access with exclusive or shared access during handling request. heaps, too many really — RbMm, Jun 14 '20 at 17:43
and impossible give general answer or do general test - result can be very depend from - what your threads doing during handling request. however i think result not visible change to any side if you multiple *iocp* and thread pools. not hard allocate several pools via `CreateThreadpool`. every thread pool maintains self I/O completion port. than you can bind different files to different pools/iocp and test. are your concrete code become faster or not with this. also how i can see windows somehow take in account numa situation here. if you use system thread pool and iocp, but not self — RbMm, Jun 14 '20 at 17:50
for exampple - https://pastebin.com/cM2XXWjJ very primitive demo of create multiple pools/iocp ( `CreateThreadpoolMinMax(4, 8)` ) and assigned files per different *iocp* sequentially ( `pptpp[n & (_countof(pptpp) - 1)]` ). but this is only demo how create multiple iocp and pools. unfortunatelly can not do real tests under high loaded servers :) — RbMm, Jun 14 '20 at 18:01

How good IOCP scales across multiple threads?

0 Answers0