Seeking tutorials and information on load-balancing between threads

Question

I know the term "Load Balancing" can be very broad, but the subject I'm trying to explain is more specific, and I don't know the proper terminology. What I'm building is a set of Server/Client applications. The server needs to be able to handle a massive amount of data transfer, as well as client connections, so I started looking into multi-threading.

There's essentially 3 ways I can see implementing any sort of threading for the server...

One thread handling all requests (defeats the purpose of a thread if 500 clients are logged in)
One thread per user (which is risky to create 1 thread for each of the 500 clients)
Pool of threads which divide the work evenly for any number of clients (What I'm seeking)

The third one is what I'd like to know. This consists of a setup like this:

Maximum 250 threads running at once
500 clients will not create 500 threads, but share the 250
A Queue of requests will be pending to be passed into a thread
A thread is not tied down to a client, and vice-versa
Server decides which thread to send a request to based on activity (load balance)

I'm currently not seeking any code quite yet, but information on how a setup like this works, and preferably a tutorial to accomplish this in Delphi (XE2). Even a proper word or name to put on this subject would be sufficient so I can do the searching myself.

EDIT

I found it necessary to explain a little about what this will be used for. I will be streaming both commands and images, there will be a double-socket setup where there's one "Main Command Socket" and another "Add-on Image Streaming Socket". So really one connection is 2 socket connections.

Each connection to the server's main socket creates (or re-uses) an object representing all the data needed for that connection, including threads, images, settings, etc. For every connection to the main socket, a streaming socket is also connected. It's not always streaming images, but the command socket is always ready.

The point is that I already have a threading mechanism in my current setup (1 thread per session object) and I'd like to shift that over to a pool-like multithreading environment. The two connections together require a higher-level control over these threads, and I can't rely on something like Indy to keep these synchronized, I'd rather know how things are working than to learn to trust something else to do the work for me.

+1 very interesting question, I'm interested in what others would have to say about this... — , Feb 27 '12 at 20:51
Very cool question, but also very vague... You have found the interesting area where I would both upvote you and vote to close. This is not a systems-design-concepts-discussion website, it's a programming answers website. I would call this "Thread Pooling" for what it's worth, not "Load Balancing" which usually implies a multi-computer system (load is balanced between two separate physical boxes). — Warren P, Feb 27 '12 at 21:42
@Warren maybe the OP didn't use the "right" words, but I, for one, am very interested in a "good" design for solving this issue, I find it funny that I was also going to ask something similar, but then I thought that I'm going to give it a shot first and then come and ask questions... however, Jerry's problem is big, you can't really spawn a new thread for each user, that's about 500 * ~2MB/thread => 1GB of ram just to handle connections... — , Feb 27 '12 at 22:14
This is really more of a design question, rather than a programming question. I think this would be a better fit at [programmers](http://programmers.stackexchange.com). — Ken White, Feb 27 '12 at 22:25
@DorinDuminica - reducing the default stack size to something more reasonable, 128K say, is a reasonable approach. 500 threads is not that many, really, assuming only a small fraction are busy at any one time. — Martin James, Feb 27 '12 at 23:05
@Martin 500 threads in a 32 bit process is going to lead to address space fragmentation. — David Heffernan, Feb 27 '12 at 23:36
The issue with large numbers of threads is not address space, it is the OS overhead of swapping threads in and out of execution. Indy handles 300-600 concurrent TCP connections (I have tested this), but then it hits a limit with thread-swapping, and CPU usage by the threads drops dramatically as all the time is consumed by the OS in task swapping. — Misha, Feb 28 '12 at 00:11
@Misha - if most of the threads are not running, there is not much to swap. The default stack space should be set lower than the default 1MB, (128K is more than reasonable), to prevent excessive virtual address space use. — Martin James, Feb 28 '12 at 00:45
@Martin, yes, but if you use Indy you need to poll for data for each server connection, so the threads have to run occasionally. The time between "polls" determines how "responsive" the sytem is. 300 connections each on a poll time of 100 ms works OK. Bump up these limits too much and the system dies. — Misha, Feb 28 '12 at 00:53
My real point is that there are fundamental limits with every system. The key is whether you are going to reach them or not. Bad architectual design is always predicated by "what ifs" without some concrete appraisal of the liklihood of the circumstances arising. Indy works well for up to around 500 concurrent clients - more than that and you need a new architecture (or splitting your server across separate machines). — Misha, Feb 28 '12 at 00:56
OK, I tried it. My box now has 2126 threads 'running', (ie. idle, like a connected Indy client that is not doing anything). No issue with CPU load - it's still at 1/2% like it was before I loaed on an extra 1000 threads, (they're all waiting on a semaphore). — Martin James, Feb 28 '12 at 00:58
@Misha - poll? What poll? Indy server-client threads usually perform blocking reads - no polling. — Martin James, Feb 28 '12 at 01:01
Wow, I wasn't expecting anyone to do any test applications to answer my question, although I appreciate it :D I added an edit to my question explaining more about what it's for and why I don't want to use Indy. — Jerry Dodge, Feb 28 '12 at 01:05
@Martin, so if you have to asynchronusly send notification data down the socket you are stuffed. Unless you have another thread for sending - and then you have two threads for each socket. For full bi-directional data transfer at any time you have to read data with a timeout of 0, i.e. read any data currently buffered, so that you can send data at any time. — Misha, Feb 28 '12 at 01:05
New record - 3114 threads, no problemo, though the test app that runs up the 2000 extra threads takes ages to start up and shut down! I set the stack in the linker options to 131072 — Martin James, Feb 28 '12 at 01:07
@Jerry, I use Indy and have complete control over task processing and pooling. The Indy threads are just used to get data in and out of the socket. All "processing" is done in separate threads. If you look at Indy as just your communications mechanism then it is obvious that Indy places no restrictions on how you process your data. — Misha, Feb 28 '12 at 01:08
@Misha - OK, if you need to send asynchronous data from the server, then that's another matter - the one thread really should wait on both the socket and a queue semaphore so as to prevent polling. Still, even without a queue wait, there is no need to wake the receive thread to send stuff on the socket. One thread could do all the sending, if queued an object with the socket reference and the data to send. I accept that this is not ideal since one socket send might block and prevent sending to all the others. A write-thread pool would be safer. — Martin James, Feb 28 '12 at 01:13
I'm giving up now. 5003 threads, no problem. Yes, I saved all my work first . Run outside the debugger, the app takes less than a second to load up 4000 threads! — Martin James, Feb 28 '12 at 01:25
@MartinJames What's the stats of the computer you're running this on? — Jerry Dodge, Feb 28 '12 at 01:35
I could just make use of OTL and dunk each process in the pool and wait for its response... But then again, I'm trying to wrap this into a component and I don't want to require another 3rd party library... — Jerry Dodge, Feb 28 '12 at 01:42
@JerryDodge - OK, it's an overclocked i7 with 12G of RAM and an SSD, but even so, the RAM use on the Task Manager 'Performance' tab only goes up from 3.44GB to 3.67GB when I add the extra 4000 threads. — Martin James, Feb 28 '12 at 02:03
Nice... When I tried to spawn 1,700 threads once, it miserably failed and I had to restart my computer :( — Jerry Dodge, Feb 28 '12 at 02:05
@JerryDodge - you have to reduce the maximum stack size which, by default, is 1MB. I set it to 128K. — Martin James, Feb 28 '12 at 10:05

score 4 · Accepted Answer · answered Feb 27 '12 at 21:36

4

IOCP server. It's the only high-performance solution. It's essentially asynchronous in user mode, ('overlapped I/O in M$-speak), a pool of threads issue WSARecv, WSASend, AcceptEx calls and then all wait on an IOCP queue for completion records. When something useful happens, a kernel threadpool performs the actual I/O and then queues up the completion records.

You need at least a buffer class and socket class, (and probably others for high-performance - objectPool and pooledObject classes so you can make socket and buffer pools).

answered Feb 27 '12 at 21:36

Martin James

24,453
3
36
60

I'm really not sure that IOCP is what he's looking for. – Warren P Feb 27 '12 at 21:43
@WarrenP - me neither, but he is asking for 'a massive amount of data transfer' and mentions 500 clients and 250 threads. I must admit, I've never tries to attach 250 work threads to an IOCP queue, (usually 16/32), but I've no reason to believe that it wouldn't work. – Martin James Feb 27 '12 at 23:15

score 3 · Answer 2 · answered Feb 27 '12 at 22:58

500 threads may not be an issue on a server class computer. A blocking TCP thread doesn't do much while it's waiting for the server to respond.

There's nothing stopping you from creating some type of work queue on the server side, served by a limited size pool of threads. A simple thread-safe TList works great as a queue, and you can easily put a message handler on each server thread for notifications.

Still, at some point you may have too much work, or too many threads, for the server to handle. This is usually handled by adding another application server.

To ensure scalability, code for the idea of multiple servers, and you can keep scaling by adding hardware.

There may be some reason to limit the number of actual work threads, such as limiting lock contention on a database, or something similar, however, in general, you distribute work by adding threads, and let the hardware (CPU, redirector, switch, NAS, etc.) schedule the load.

+1 Thanks for the info, most of that I'm already doing actually. I don't necessarily intend for this to be run always on a server-class computer, although it presumably is supposed to. I am interested in the idea of two servers actually, because this 1 server I'm building really contains 2 server sockets. The only question is will it be suitable to build two servers and transfer image data between the 2? These will be Service Apps, btw. It just seems like adding yet another layer around it to slow the image xfer down. — Jerry Dodge, Feb 28 '12 at 00:32

score 2 · Answer 3 · edited May 23 '17 at 12:32

2

Your implementation is completely tied to the communications components you use. If you use Indy, or anything based on Indy, it is one thread per connection - period! There is no way to change this. Indy will scale to 100's of connections, but not 1000's. Your best hope to use thread pools with your communications components is IOCP, but here your choices are limited by the lack of third-party components. I have done all the investigation before and you can see my question at stackoverflow.com/questions/7150093/scalable-delphi-tcp-server-implementation.

I have a fully working distributed development framework (threading and comms) that has been used in production for over 3 years now across more than a half-dozen separate systems and basically covers everything you have asked so far. The code can be found on the web as well.

edited May 23 '17 at 12:32

Community

1
1

answered Feb 27 '12 at 22:28

Misha

1,816
1
13
16

I actually just dropped Indy prior to asking this for that exact reason, it's why I'm asking this question. 500 clients on Indy would indeed kill it due to 500 threads. – Jerry Dodge Feb 27 '12 at 23:00
500 threads is not that many! According to the Task Manager, I have 1129 threads 'running' on my box now. Of course, most of them are not actually running, just like 1129 connected clients with most protocols. – Martin James Feb 27 '12 at 23:11
1

@Jerry, that's a fairly premature design decision! The usability of Indy is an order of magnitude better than any Delphi IOCP implementation, or one that you develop yourself. Unless you are absolutely sure that you are going to exceed 500 concurrent connectiions by some margin, dropping Indy because of this is not a well thought-out decision. – Misha Feb 28 '12 at 00:08
@DorinDuminica - yes it is, but why do you think that matters? If anything, 1000 threads running the same code in one app will use fewer resources than 1000 threads spread over many apps. – Martin James Feb 28 '12 at 00:37
@Misha I have other reasons why I got away from Indy... It's a great suite, but for my purposes, it's a little overkill and I'd like to do the thread management myself. – Jerry Dodge Feb 28 '12 at 00:51
1

@Jerry, the best designs use as little "self-plumbing" as possible, unless you are into developing frameworks, which is a whole different ball game. You should ask yourself whether you want to develop a product, or a framework, because you don't want to do both! – Misha Feb 28 '12 at 01:00
Chrome is throwing fits again, I know I upvoted this and posted a comment, but somehow didn't get sent to SO. So... +1 Looks like a great thing to look into, thanks. – Jerry Dodge Feb 28 '12 at 01:15
One thread per connection <> one thread per user, unless you have clients that keep their connections alive all the time. – Marjan Venema Feb 28 '12 at 08:11

Seeking tutorials and information on load-balancing between threads

3 Answers3