1

I have a small web app written in perl running mod_perl under apache. All it does is create a socket connection to a server and waits for an OK message before it sends a request. we only have max 10 children. At random times the reading of this Ok message fails. Other reads are fine at the same time. I have strace'd it with

sudo strace -x -o traceout.log -f -tt -s 1024 -p 23735

normal reads have:

31317 14:27:18.043630 alarm(30)         = 0
31317 14:27:18.043722 read(17, "OK nGSrv ready. $Revision: 1.59 $  Built: May  5 2017 11:17:19 - [1]\r\n", 4096) = 70
31317 14:27:18.043811 alarm(0)          = 30

but failures have:

31198 14:26:34.350791 alarm(30)         = 0
31198 14:26:34.350836 fstat64(16, {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0
31198 14:26:34.350934 read(16, "OK nGSrv ready. $Revision: 1.59 $  Built: May  5 2017 11:17:19 - [3]\r\n", 4096) = 70
31198 14:26:34.351014 read(16,  <unfinished ...>
:
:
31198 14:26:39.345766 <... read resumed> "", 4096) = 0
31198 14:26:39.345829 alarm(0)          = 25

the 5 second alarm/timeout is the other end closing the connection as it has not received a request.

Does anyone know why the fails have this extra fstat64 and the unfinished read?

$server = IO::Socket::INET->new(Proto => "tcp",
                              PeerAddr  => $ip,
                              PeerPort  => $port,
                              Timeout   => $timeout);
if( $server ) {
    eval {
      local $SIG{ALRM} = sub { die "alarm\n" }; # \n required!!!!
      alarm $queuetimeout;
      $greeting = <$server>;
      alarm 0;
  };
  if($@) { # Something in the eval died
      unless( $@ eq "alarm\n" ) {
        # Unexpected Error
        $greeting = 'ERROR'; # Force an ERROR response
      } else {
        #  Timeout
        $greeting = 'BUSY'; # Force a BUSY response
      }
  }

The line giving the 'extra' fstat64 and the unfinished read is:-

$greeting = <$server>;

This works fine until some random event causes 1 in 10 to fail, for a random period of time, then they stop. This affects 6 web servers all on the same network, with common mounts and common DB. The only difference we can find is this fstat64. These 6 web servers (apache, mod_perl) connect to one of 2 other servers also on the same network. We have tcpdump'ed both servers and can see that the "OK nGSrv ready..." message is sent immediately and received immediately, but somehow (only during these random periods) is not read fully/correctly by the client.

MArk W.
  • 21
  • 2
  • 1
    Please include some of the Perl code you're using to read from the socket. –  Sep 19 '18 at 18:00
  • I'm pretty sure that means the process received a signal. This interrupted the `read` so the signal could be handled, after which the `read` was resumed. – ikegami Sep 20 '18 at 13:07

0 Answers0