How do I account for messages being broken up when using sockets?

Question

My Design

I'm using sockets to implement a chat server.

The client side uses Java's java.net.Socket and BufferedReader to read messages from the server.

The server side uses Php's socket_read() to get messages from the clients.

And it uses Php's socket_write() to send messages from the server. socket_write() does not guarantee that the entire original message will be written out, which means I may have to make multiple calls to it to send out the entire original message.

(In terms of design, clients send messages to the server, and server reroutes those messages to the appropriate clients.)

Concerns

My concerns are that a message may be broken up into several smaller messages. So when the server or a client reads an incoming message, it may actually be a fragment of the original.

Questions

Is this something I need to account for? If yes, how?

Possible Solution

Right now I'm thinking about using byte stuffing (which is a networking technique to insert bytes into the original message that serve as flags to mark the start and end of a message before sending it out).

score 2 · Answer 1 · answered Feb 13 '15 at 02:33

If you need application-level messages, then you have to implement them at application level. There are several common approaches:

1) Use fixed-length messages.

2) Prefix each message with its length.

3) Use an 'end of message' marker that naturally never appears in your messages.

4) Use an 'end of message' marker and escape it if it appears in your messages.

score 1 · Accepted Answer · answered Feb 13 '15 at 02:54

Yes, this is something you need to handle in your protocol.

The two most typical approaches here are:

Make your protocol line-oriented. Terminate every message with a newline, and don't treat a line as complete until you see that newline character. This, of course, depends on newlines not naturally appearing in messages.

Some protocols which use this approach include SMTP, IMAP, and IRC.
Include the length of the message in its header, so that you know how much data to read.

Some protocols which use this approach include HTTP (in the Content-Length header) and TLS, as well as many low-level protocols such as IP.

If you aren't sure which approach to take, the second one is considerably easier to implement, and doesn't place any restrictions on what data you use it with. A simple implementation might simply store the count of bytes as a packed integer, and could look like the following pseudocode:

send_data(dat):
    send(length of dat as packed integer)
    send(dat)

recv_data():
    size = recv(size of packed integer)
    return recv(buffer)

(This code assumes that the abstract send() and recv() methods will block until the entire message is sent or received. Your code will, of course, have to make this work appropriately.)

I'm so glad I asked this. I took a networking course and didn't even consider storing the count in the header because it's possible that the count may be corrupted in the link layer. I completely overlooked that TCP handles corrupted bytes for me so storing the count in an application level message is 100% viable. I was about to work a little harder than I had to so thank you :) — Kacy, Feb 13 '15 at 03:16
Corruption isn't the issue here. The real problem is that the OS may reframe TCP packets as it sees fit — depending on circumstances, one `send()` may result in multiple packets, multiple `send()`s may result in a single packet, and so on. If you want to break a TCP stream into a series of distinct messages, you need to draw the lines yourself. — , Feb 13 '15 at 03:23

How do I account for messages being broken up when using sockets?

2 Answers2