0

I realise that I'll get at least one answer along the lines of "(re)write the code so it doesn't hang" but let's assume we don't live in that shiny happy utopia just yet...

In our embedded system we have a big SDK including a web-server (Boa) which is the primary method of user interaction.

It's possible, during certain phases of the moon, that something can cause the web server to hang or become otherwise stuck in such a way that the process appears running normally (not crashed/dead/using 100% CPU) but does not serve any web pages.

So, the question is, how do we test/detect this situation?

John U
  • 2,886
  • 3
  • 27
  • 39
  • 2
    Send a small, well-defined query to it and check that you get the expected response within a given time limit? – isedev Oct 01 '14 at 13:16
  • Well, yeah, I sort of guessed the broad strokes would be of that order, what I'm after is the *method* of sending an HTTP request or similar to the web server with minimal footprint. – John U Oct 01 '14 at 16:19
  • Simple solution would be a shell script doing `wget -q -O /dev/null TESTURL` and checking the status code returned by `wget`. – isedev Oct 01 '14 at 16:21

1 Answers1

2

To test whether the server is hung, create a TCP socket and connect to port 80 on IP address 127.0.0.1 (loopback address). Then send the following text over the socket

GET / HTTP/1.1\r\n\r\n

Most servers will interpret that as a request for index.html. Alternatively, you could implement an undocumented URL for testing (which allows for a shorter, predetermined response), e.g.

GET /test/fdoaoqfaf12491r2h1rfda HTTP/1.1\r\n\r\n

You then need to read the response from the server. This involves using select with a reasonable timeout to determine whether any data came back from the server, and if so, use recv to read the data. The response from the server will consist of a header followed by content. The header consists of lines of text, with a blank line at the end of the header. Lines end with \r\n, so the end of the header is \r\n\r\n.

Getting the content involves calling select and recv until recv returns 0. This assumes that the server will send the response and then close the socket. Some sophisticated servers will leave a socket open to allow multiple requests over the same socket. A simple embedded server should not be doing that. (If your server is trying to use the same socket for multiple requests, then you need to figure out how to turn that feature off.)


That's all very well and good, but you really need to rewrite your code so it doesn't hang.

The mostly likely cause of the problem is that the server has a bunch of dangling sockets, i.e. connections from clients that were never properly cleaned up. Dangling sockets will eventually prevent the server from accepting more connections, either because the server has a limit on the number of open connections, or because the process that's running the server uses up all of its file descriptors.

The first thing to check is the TCP timeout value. One project that I worked on had a default timeout of 5 hours, which meant that dangling sockets stayed open for 5 hours. A reasonable timeout is 1 minute.

Then you need to create a client that deliberately misbehaves. Clients can misbehave by

  • leaving a socket open without reading the server's response
  • abruptly closing the socket while reading the response
  • gracefully closing the socket while reading the response

The first situation should be handled by the TCP timeout. The other two need to be properly handled by the server code. Graceful and abrupt socket closure is controlled via the SO_LINGER option of ioctl and the shutdown function. After the client misbehaves, check the number of open file descriptors in the server process, to verify that the server has handled the situation correctly.

Community
  • 1
  • 1
user3386109
  • 34,287
  • 7
  • 49
  • 68
  • Good answer. Ref the 2nd half: It's not my code, it's Boa that's been hacked about by Elbonian Code Slaves as part of a massive hairy and undocumented SDK, the hangs are due to failed calls to other parts of the SDK, we're gradually shaving it and beating it into submission but eradicating *all* the nasty could take man-years of time, so mitigating the effects is our best option as a sticking plaster. – John U Oct 02 '14 at 08:34
  • I hear ya, and you have my condolences. It's amazing how bad some open-source code is, given its apparent wide spread use. At some point you have to decide whether it's better to start from scratch, rather than spending man-years unraveling someone else's ball of spaghetti. In my case, I canned the 3rd party server code, and made my own. – user3386109 Oct 02 '14 at 16:56
  • It's worse than that, it's a commercial SDK from Texas Instruments, although they try and distance themselves from the 3rd party Elbonian outfit that produced a lot of it on their behalf. – John U Oct 03 '14 at 10:24