Java NIO based scalable non blocking TCP client server design: recommend way to send large response back to client

Question

I am trying to implement a highly scalable server in java for the following usecase

Client sends a request to server in a form of COMMAND PARAM
Server can send a varying size response, from a few characters (10 bytes) to large text data (of size 6-8gb, equivalent to client ram)

What should be the appropriate way to send response in these scenarios. I need to support multiple concurrent clients. Can some one point me to a reference/sample implementation?

At some point you need to just write the data to the stream. But you need to explain what you mean by "non-blocking TCP client server design". Not blocking what? Client side? Server side? Threads? What kind of server is this? HTTP? Raw TCP? What kind of scalability? Bear in mind that a single server solution is by definition not scalable. — Stephen C, Dec 02 '20 at 02:01
A request for a sample / reference implementation is off-topic. (If one exists, you should be able to find it using Google, etc.) — Stephen C, Dec 02 '20 at 02:02
@Stephen, This in in context of row TCP enabled server design for single node. what i mean by scalability is - design needs to support multiple concurrent clients may be 100s or 1000s depending on server resources. — ankit Soni, Dec 02 '20 at 04:27
You should put all of the relevant details into your Question. Use the edit button. — Stephen C, Dec 02 '20 at 04:34

rzwitserloot · Answer 1 · 2020-12-02T12:54:07.900

There are many existing solutions that use NIO under the hood, such as Netty and frameworks like Grizzly. They are borderline rocket science. Getting NIO right is extremely complicated. Use those frameworks.

highly scalable server in java

NIO is usually slower. The primary benefit of NIO is that you can handroll your buffers, vs. threaded models where the stack size is locked in (you get to configure the stack size for all stacks once as you start java itself with java -Xss1m for example, for a 1MB stack (meaning, 100 threads require 1GB of memory just for the stacks, let alone the heap).

Usually, tossing a big RAM stick at your box is many orders of magnitude more effective.

NIO shines when ALL of the following things are true:

You need to deal with many simultaneous connections, but not a great many (because a great many cannot be handled by one computer now matter how efficiently you write it; the solution then is sharding and distributed design. google.com does not run on one computer and never could, nor does twitter - that's an example of simultaneous connection requirements that exceed where NIO is useful).
NIO means you can be forced to 'swap out' at any time. For example right in the middle of parsing a command. That means you need buffers to store state. The state you need to store needs to be small, or NIO is not very useful.
The task that needs doing needs to not be CPU bound. If it is, NIO is just going to make things slower.
The task that needs doing needs to not be blocking-bound. For example, if, as part of the job of handling a connection, you need to connect to a database, unless you go out of your way to find a way to do so in a non-blocking way, you can't use NIO. NIO requires that you never block a thread for any reason. This means callback hell. You may want to look that up.
The performance benefit is so important, it is worth complicating the development of your app by an order of magnitude to accomodate it.

That leaves only a tiny window where full NIO is advisable. And that tiny window of exotic applications where it makes sense will very soon be even smaller, because Project Loom is hopefully heading for a preview release in java 17 (could be as soon as 9 months from now or so), further reducing whatever gains you could make happen with NIO.

The general setup of NIO works like so:

First, you make X threads, where X is about twice the amount of cores you have.
Then, you define a bunch of job handlers; each handler is an object that maintains the state of a job to do, in your case, one represents waiting for incoming socket connections, and for each open connection you'd also have a job object.
Each thread tries to manage all jobs simultaneously. This works by making asynchronous channels for each job. For each channel, you register what is 'interesting' (what would imply you can do work without waiting for I/O). Then, you go to sleep, asking java to wake up your thread if anything interesting happens on any job. You then broker with the other threads which one is going to handle any particular job that needs doing, and does it.

The state of 'I am interested in' is in continuous fluctuation.

For example, in the beginning, any connection is interested in 'if data comes in'. But then when data comes in, maybe only 'HEL' comes through, (the complete command is 'HELLO\n', for example), you'd have to remember that and just go right back to the loop. Once the full HELLO command is in, you want to write 'HEY THERE!' back, but when calling the send on that channel, only 'HEY T' is sent so far. You then want to stop being interested in 'data is available' and start being interested in 'you can send data now'. You did not want to be interested in that before, because then your thread is continuously woken up (you can send! you can send! you can send! you can send!), resulting in your fans spinning up and everything becoming slow as molasses.

Once you sent the full HEY THERE!, you want to stop being interested in 'can send' again, and start being interested in 'data available' again.

Juggling the brokering between threads and the interest flags on your channels is very complicated. You also get the fun of this tidbit:

If you ever block on anything, your app is broken, but you won't know. It'll just be really inefficient and slow but only once multiple connections start coming in. This is really hard to test, and a very easy trap to fall into. No exceptions will occur, for example. You'll also end up in callback hell.

Which gets us back to: It's rocket science. Use netty or grizzly.

EDIT: Your specific use case

As per a comment on this question, you want to write a server that will handle requests along the lines of 'RANGE 100000000-100002000', which is to be interpreted as: Send me the bytes at the stated index range from some large file.

I don't think NIO is useful for you in that case. But it might be. You'd design the system roughly like so:

You would have to use NIO2 to do your disk access asynchronously. Here is a tutorial for an introduction to this. If it's all in memory, that's much simpler.

Even if you use async NIO2 file access, you're just spinning your wheels and not making things any faster if the underlying disk isn't both very fast and entirely random access.

If you do this to a platter disk, hoo boy. Horrible performance.

The thing is, incoming requests ask for a consecutive block of bytes. spinning disks can do that much, much faster than asking for a bunch of bytes from a bunch of small locations scattered around the disk: Reading all over the disk requires the head to hop around, which is very slow.

The big risk here is that by aggressively NIO2-izing your disk access, you effectively turn your program into 'hops around the disk' mode, and performance gets worse, not better.

The better alternative is probably something simple, like this:

You have a threadpool. Maybe as low as 50 threads, maybe as many as 1000.
This is nowhere near enough to really make the VM break out in a sweat.
Your socketServer.accept() call will wait for a free thread in the pool in its 'accept a socket, make the handler object, hand it to the pool to process' loop, thus, if it needs to wait because no thread is available, accept calls are effectively stoppered up for a bit. That's good.

Effectively then your app will handle the first X (X = pool size) calls simultaneously and will then let the phone ring for a bit, so to speak, until one is done. This is probably what you want, anyway - aggressively parallellized disk access isn't the fastest way to read a disk.

If you really want to know, you're going to have to write this app twice and compare the two.

rzwitserloot, Thanks for your detailed reply. This helps in building some understanding. Request to clarify one of your point - The task that needs doing needs to not be blocking-bound (Involves Db connection). --- For example if the server is intended to serve the contents (lets say specific line) from a single large file. So effectively multiple client requests are expected to hit server asking for specific part of the file. Do you think NIO will not help here? What are general recommendations to build a TCP server for this kind of usecase. — ankit Soni, Dec 02 '20 at 04:41
rzwitserloot, thanks for your explanation. If this is a HDD (magnetic disk), will disk io (read in this case) will be sequential only if multiple concurrent threads(threadpool) is used? Also trying to use MemoryMappedBuffer..do you see any concern of using it in this case? — ankit Soni, Dec 02 '20 at 13:21
It's sequential if you just.. create a RandomAccessFile object, skip ahead to the right place, and read.. and _DO NOT USE_ threads. Just have one that serially handles requests, that's probably fastest. It sounds nuts, but the disk is the bottleneck, and thus optimizing for the bottle neck at the cost of all other things is nevertheless the right move. The good news is, that's very simple code to write :) - involving threads or async at all will __make things slower__ for magnetic disks. — rzwitserloot, Dec 02 '20 at 13:47
rzwitserloot, thanks for your input. In this usecase, different size of response needs to be send to clients ( for each request), depending on request (for example, from few bytes to mbs to gb, as large can ram can handle), what is a good mechanism to handle this in code...? Should chunked response kind of mechanism ideal or do you recommend other way..thanks in advance.. — ankit Soni, Dec 02 '20 at 17:25
Chunks are always a good idea. note that you want a single 'disk reader' thread, but multiple socket threads (which wait for the disk reader to be available, and you'll need to cook up some sort of brokerage algorithm: Who gets priority? The one waiting longer, or the one with the smaller request?) - after all, you don't want the system to slow down because someone with a slow network connection connects. — rzwitserloot, Dec 02 '20 at 21:21
rzwitserloot,another approach where before we start serving client requests, we split a big large file into multiple smaller file chunks (i.e 100gb -> 5gb chunks, total 20 chunk files), but all chunks will be on same magnetic disk, So IMO, at a time single thread will be able to do disk IO (read data from single chunk file, not from many files in parallel), is this correct understanding...? so will this (split large file and small size file and a prior book keeping of file chunks - start-end offsets for quick lookup) be a better way compare to read a single large file...? — ankit Soni, Dec 03 '20 at 04:47
No, there is no meaningful difference between having a single 100GB file vs 100 1GB files, or even 10,000 10MB files. The single file would be faster, if anything. — rzwitserloot, Dec 03 '20 at 12:54
rzwitserloot, thanks for your comment, I have one more doubt reg your earlier conment- "Chunks are always a good idea. note that you want a single 'disk reader' thread, but multiple socket threads", Do you mean here to have some queue mechanism in-between, multiple writer threads and single reader thread, something like writers keep adding their request in queue and reader take and process/serve them one by one...? How can I assure that reader-thread being a single, does not get starved and always gets cpu..pls help. — ankit Soni, Dec 03 '20 at 13:27
Yes. You can adjust the priority with `.setPriority` if you like, though I doubt it'll be neccessary. — rzwitserloot, Dec 03 '20 at 13:49
@rzwitserloot: You should add part of your comments to the answer, to keep the insights they offer from being "buried" in the comments. — Peter O., Feb 06 '21 at 17:14

Java NIO based scalable non blocking TCP client server design: recommend way to send large response back to client

1 Answers1

EDIT: Your specific use case

Linked