TCP Server w/ boost::asio, scalability of thread pool vs stackless coroutines

Question

I'm building a TCP-based daemon for pre-/post-processing of HTTP requests. Clients will connect to Apache HTTPD (or IIS), and a custom Apache/IIS module will forward requests to my TCP daemon for further processing. My daemon will need to scale up (but not out) to handle significant traffic, and most requests will be small and short-lived. The daemon will be built in C++, and must be cross-platform.

I'm currently looking at the boost asio libraries, which seem like a natural fit. However, I'm having trouble understanding the merits of the stackless coroutines vs thread pool pattern. Specifically, I'm looking at HTTP server example #3 and HTTP server example #4 here: http://www.boost.org/doc/libs/1_49_0/doc/html/boost_asio/examples.html

Despite all of my googling, I'm unable to fully comprehend the merits of the stackless coroutine server, and how it would perform relative to the thread pool server on a multi-core system.

Which of the two is most appropriate given my requirements, and why? Please, feel free to 'dumb down' your answers regarding the stackless coroutine idea, I'm still on shaky ground here. Thanks!

Edit: Another random thought/concern for discussion: Boost HTTP server example #4 is described as "a single-threaded HTTP server implemented using stackless coroutines". OK, so it's entirely single-threaded (right? even after the parent process 'forks' to a child? see server.cpp in example #4)...will the single thread become a bottleneck on a multi-core system? I'm assuming that any blocking operations will prevent all other requests from executing. If this is indeed the case, to maximize throughput I'm thinking a coroutine-based receive-data async event, a thread pool for my internal blocking tasks (to leverage multi cores), and then an async send & close connection mechanism. Again, scalability is critical. Any thoughts?

*Query*: I think I understand "to scale up". What is "to scale out"? — Robᵩ, May 11 '12 at 14:56
Some people find the co-routine approach simpler to read/implement because the code reads from top to bottom. They are a better fit for streaming parsing because you don't have to worry about picking-up where you left off once you start consuming the stream again after a break in input. — avid, May 11 '12 at 14:59
@avid -- thanks, I saw in the boost HTTP server example #4 how they did that with the request parser. It's very nice, no doubt, but I'm more concerned with performance than ease of coding/implementation. What are your thoughts on this? — Tom, May 11 '12 at 15:14
@Rob - by "scale out", I mean to add additional machines and distribute load across them (which I don't think I'll need; my users will add another web server node with another instance of my app in the background, my app doesn't need to be web-farm-aware) — Tom, May 11 '12 at 15:19
As you say the purpose of a thread pool is to utilize multiple cores, so this is a somewhat orthogonal concept to coroutines. — Guy Sirton, May 12 '12 at 01:47
@GuySirton - yes, I'm starting to see that. With this new understanding, I suppose my real question is how boost forking (as demonstrated in HTTP server example #4) relates to threading. What the heck does 'fork' do? Looking at the code (coroutine.hpp), I don't see any thread stuff, so I'm going to assume it's just more coroutine stuff, and therefore within the same thread. — Tom, May 12 '12 at 20:28
@TomC: yes. the "fork" in the context of the stackless coroutine sample has nothing to do with threading. It looks like the traditional fork operation but co-routines aren't threads. I haven't used the asio coroutine stuff but I think it's mostly for readability, i.e. to make the event driven code more readable. These aren't really part of boost, they're just in the asio sample, see more here: http://blog.think-async.com/2009/07/wife-says-i-cant-believe-it-works.html — Guy Sirton, May 13 '12 at 04:26
@TomC: if you look through the sample code you can see how the connection is handled in a fairly sequential order in one function. The magic glue that makes that work is that the last state is remembered and when the function is called a goto is executed to the right line number. So unlike a traditional asio program where you'd have special handler functions you can have all the sequence in one function and rely on "reenter" to teleport you to the right point in the function. This is my quick impression. You'd need to profile this to see if it really makes any difference to performance. — Guy Sirton, May 13 '12 at 04:48
@GuySirton: thank you for confirming my suspicions. I've read the think-async entries but the author presupposes a level of expertise that I did not possess. looks like I'm just going to have to profile it as you suggest, thanks again for your help — Tom, May 13 '12 at 23:33

cmeerw · Accepted Answer · 2012-05-13T11:37:22.550

9

I have recently looked at the scalability of boost.asio on multi-core machines. The main conclusion so far is that it does introduce overhead, lock contention and additional context switches (at least on Linux), see some of my blog posts on these topics:

I also started a thread on the asio mailing list to check that I haven't missed anything obvious, see http://comments.gmane.org/gmane.comp.lib.boost.asio.user/5133

If your main concerns are performance and scalability then I am afraid, that there is no clear-cut answer - you might have to do some prototyping and look at the performance.

If you have any blocking operations then you would definitely want to use multiple threads - on the other hand, context switching and lock contention can decrease performance with multiple threads (at least you will have to be very careful).

Edit: just to clarifly the stackless coroutines stuff: it's essentially just some syntactic sugar to make the asynchronous API look a bit more like sequential/blocking calls.

edited May 13 '12 at 11:37

answered May 12 '12 at 21:12

cmeerw

7,176
33
27

Thanks for sharing your expertise, I'll be sure to dig deeper into your blog & sample code as I build my server. Great ASIO thread too. It appears as though I'll simply have to benchmark a few different implementations and proceed from there. Hey someone in your ASIO thread asked if you utilized the coroutine functions, but didn't see a response. Did you plan on profiling this? Thank you for your help! – Tom May 13 '12 at 23:37
How'd you make out with your benchmarks and analysis Tom? I'm curious to see where you got to because I'm playing with this too. – Homer6 Jun 30 '12 at 07:45
@Homer6 -- I re-architected my app after I confirmed that the boost example #4 was indeed limited to a single core. Even if it wasn't, in my case, it made more sense to run within the web server's thread, rather than creating my own pool. So, at least for now, I have abandoned the idea, and have not pursued it further. Sorry I can't be of more help! – Tom Jun 30 '12 at 19:57

score 0 · Answer 2 · answered May 11 '12 at 15:03

0

You need to measure the effects to be certain what will actually happen due to the difficulty of predicting the relative effects of locality of reference, CPU instruction caching, scheduling delays, etc.

If you want a heuristic guess, consider that using n threads with stack size S each always takes nS bytes however much of the stack space each thread actually uses. If that pushes you across a page boundary, it could decrease performance measurably.

answered May 11 '12 at 15:03

Mike Burrows

236
3
7

Thanks Mike, I suppose I'm asking more of a theoretical "are stackless coroutines more performant than thread pools?" Or am I wrongly conflating two separate concepts? – Tom May 11 '12 at 15:17
@TomC: If you're using Windows, go for the option with the least thread switching. Carefully written coroutines will suffer from less cache thrashing in my experience. If you're blocking for long periods of time the thrashing is less signifcant. – JimR May 11 '12 at 15:59
@JimR: thanks for the tip. I plan to target both Windows and *nix, so your advice is appreciated – Tom May 11 '12 at 20:15

TCP Server w/ boost::asio, scalability of thread pool vs stackless coroutines

2 Answers2