2

We have two AIX boxes, one for production system and another for testing.

both systems are running ATM machine switches, where the ATM device is connected via TCP socket.

we had an issue on production system where the machine would power off or get disconnected but the netstat -na | grep <IP of machine > would still mention that the socket is up

when simulated that case on the UAT environment, the problem did not happen, where the socket would terminate in 3 to 5 minutes.

when sniffed on the traffic between the machine and ATM we found that no traffic takes place on production while there is some sort of heartbeat on UAT. but it is not initiated by the application.

$>tcpdump | grep -v "10.2.2.71" | grep -v "HSRP" | grep "10.3.1.30"
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on en6, link-type 1, capture size 96 bytes
09:08:13.323421 IP server073.afs3-callback > 10.3.1.30.impera: . 278204201:278204202(1) ack 3307884029 win 164
09:08:13.335334 IP 10.3.1.30.impera > server073.afs3-callback: . ack 1 win 64180
09:08:23.425771 IP 10.3.1.30.impera > server073.afs3-callback: . 1:2(1) ack 1 win 64180
09:08:23.425789 IP server073.afs3-callback > 10.3.1.30.impera: . ack 2 win 65535
09:09:13.628985 IP server073.afs3-callback > 10.3.1.30.impera: . 0:1(1) ack 1 win 164
09:09:13.633900 IP 10.3.1.30.impera > server073.afs3-callback: . ack 1 win 64180
09:09:23.373634 IP 10.3.1.30.impera > server073.afs3-callback: . 1:2(1) ack 1 win 64180
09:09:23.373647 IP server073.afs3-callback > 10.3.1.30.impera: . ack 2 win 65535

while on production, that traffic is not there.

we want to know where this traffic is initiated from to implement on production to sense disconnection

our comms parameters are:

          tcp_keepcnt = 2
         tcp_keepidle = 100
         tcp_keepinit = 150
        tcp_keepintvl = 150
         tcp_finwait2 = 1200

can anyone help?

Editing Question: One point I missed because I was rushing to a meeting. the difference between the Production and UAT in setup is that in Production we have an application called F5 working as load balancer between the ATMs and the AIX box, while it is a direct connection through MPLS in case of UAT.

note: we had one MPLS and one GPRS connected ATMs on UAT, and both connections terminated when unplugged in about 4 minutes

Edit 2

the no -o tcp_timewait command returns 1 in both Production and UAT

A.Rashad
  • 293
  • 1
  • 5
  • 18
  • 1. Avoid using grep when dealing with IPs, [use fgrep instead!](http://unix.stackexchange.com/questions/21020/how-can-i-test-whether-connection-to-the-given-host-port-is-established-in-bash/21060#comment49428_21060) 2. tcpdump is quite powerful tool, so that chain of grep *can be better written using its own bpf syntax* 3. tcpdump's `-w` could write traffic down into a file, so you can look into it later or even share with us its dump. `-n` would get rid of all those awkward port naming. `-X` (can be repeated to increase verbosity) shows content which can pour light upon the matter of the traf – poige Apr 21 '12 at 11:00
  • Sounds like you have TCP keepalive options working on UAT but not on the production host. Are those comms parameters the same on both hosts, for the application sockets in question? Also, what is the tcp_timewait setting? If I read the AIX names correctly, the TCP stack will send a keepalive packet after 100 half-seconds of inactivity, resend after 150 half-seconds, and after 2 unsuccessful probes close the connection. So if the remote host dies, it will take more than 3 minutes for this to be noted and start the FIN_WAIT. Are those times consistent with what you're seeing on the UAT side? Whe – mpez0 May 24 '10 at 11:37
  • ok, let's answer them one by one: 1- I need to check the tcp_timewait on both systems in 13 hours from now 2- it takes around 3 to 5 minutes according to our rough timing, not using any cronometers, only our sense of time 3- connection up means ESTABLISHED as an outcome of `netstat` 4- we have a load balancer between the production host and the ATM, while it is not there in UAT thanks for your help – A.Rashad May 24 '10 at 15:50
  • The F5 box could be sending keepalives itself. – sendmoreinfo Dec 13 '11 at 14:24

0 Answers0