1

We have an linux application (we don't have the source) that seems to be hanging. The socket between the two processes is reported as ESTABLISHED, and there is some data in the kernel socket buffer (although nowhere near the configured 16M via wmem/rmem). Both ends of the socket seem to be stuck on a sendto().

Below is some investigation using netstat/lsof and strace:

HOST A (10.152.20.28)

[root@hosta ~]# lsof -n -u df01 | grep 12959 | grep 12u
q         12959 df01   12u  IPv4            4398449                TCP 10.152.20.28:38521->10.152.20.29:gsigatekeeper (ESTABLISHED)

[root@hosta ~]# netstat -anp | grep 38521
tcp   268754  90712 10.152.20.28:38521          10.152.20.29:2119           ESTABLISHED 12959/q

[root@hosta ~]# strace -p 12959
Process 12959 attached - interrupt to quit
sendto(12, "sometext\0somecode\0More\0exJKsss"..., 542, 0, NULL, 0 <unfinished ...>
Process 12959 detached
[root@hosta~]#

HOST B (10.152.20.29)

[root@hostb ~]# netstat -anp | grep 38521
tcp    72858 110472 10.152.20.29:2119           10.152.20.28:38521          ESTABLISHED 25512/q

[root@hostb ~]# lsof -n -u df01 | grep 38521
q         25512 df01   14u  IPv4            6456715                 TCP 10.152.20.29:gsigatekeeper->10.152.20.28:38521 (ESTABLISHED)

[root@hostb ~]# strace -p 25512
Process 25512 attached - interrupt to quit
sendto(14, "\0\10\0\0\0Owner\0sym\0Type\0Ctpy\0Time\0Lo"..., 207, 0, NULL, 0 <unfinished ...>
Process 25512 detached
[root@hostb~]#

We have upgraded the NIC driver to the latest and greatest. The systems are running RHEL 5.6 x64 (2.6.18-238.el5), I have checked the eratta for RHEL 5.7 and 5.8 but I can see no mention of bugs with the bnx2 driver or the kernel.

Does anyone have any ideas of how to debug this further?

The_Viper
  • 391
  • 1
  • 6
  • 14
  • 1
    Your system is fine, but your application is broken. Trash it. The programmer is most likely doing synchronous I/O in an unsafe way, leading to this deadlock. – BatchyX Sep 15 '11 at 10:39
  • generally the available space is half of the configured wmem/rmem, often less. – nos Sep 15 '11 at 11:35

1 Answers1

3

Is either side actually reading? If not, it could be that both sides' receive buffers are full, leading to not sending data (due to the receive window being filled), leading to both send buffers being filled, which will cause sendto to block. (It's possible that this could happen despite your setting of wmem/rmem if the application is setting the SO_RCVBUF and SO_SNDBUF socket options.)

To debug this, I'd synchronize both machine's clocks, then run both applications under strace with the -e trace=network and -tt options, so you can compare the logs and see if the application isn't reading.

You could also use a network analyzer (such as Wireshark) to determine if the TCP receive window gets stuck on 0.

If this is the case, you could probably work around this by creating a small caching proxy, which would recv/send from both sides, buffering whatever can't be sent at the time.

Hasturkun
  • 35,395
  • 6
  • 71
  • 104
  • Neither side is reading. Its a single threaded application, and both are stuck on sendto(). I have to ^C out of the strace output above. i didnt know about the network tracing function of strace, this is definately where I will be looking next. I have already checked, again using strace, that SO_RCV/SOSNDBUF isn't getting set wnen the socket is created, and it isn't. The rmem/wmem settings are set to 16M, and I can get the RECQ/SNDQ nearly full to this amount using netperf like: tcp 0 15570344 10.152.20.28:57385 10.152.20.29:40366 ESTABLISHED 4875/netperf – The_Viper Sep 15 '11 at 12:27
  • I will also get a pcap of the traffic, and look out for the window size. My understanding of window scaling is that it helps when two hosts are on the end of a LFP resulting in a large BDP (data in flight). Would it be outrageous to disable window scaling on a host when it is communicating with other hosts on the same LAN? – The_Viper Sep 15 '11 at 12:34
  • I ended up using systemtap and the wonderful pfiles stap script to troubleshoot this further. highly recommended: http://sourceware.org/systemtap/wiki/WSPfiles – The_Viper Sep 21 '11 at 09:39
  • 2
    @The_Viper: I'd love to hear what the problem was if you find a solution – Hasturkun Sep 21 '11 at 10:22