6

I have several servers, where the failed connection attempts metric returned by netstat -s (from /proc/net/snmp) grows by roughly one per second, and I'd like to diagnose the source of these.

By using this ipTables rule (on a different server):

-A OUTPUT -p tcp --dport 23 -j REJECT

I am blocking outgoing telnet, so I can run this loop:

while true ; do
telnet www.google.co.uk
netstat -s | grep "failed connection"
done

Trying 209.85.203.94...
telnet: Unable to connect to remote host: Connection refused
52 failed connection attempts
Trying 209.85.203.94... telnet: Unable to connect to remote host: Connection refused
53 failed connection attempts
Trying 209.85.203.94... telnet: Unable to connect to remote host: Connection refused
54 failed connection attempts

So proving that the counter is incremented by failed attempts to connect to remote sockets. (Although it doesn't prove that that's the only cause of increments, of course).

The question is, how can I find the specific combination of remote address and port (or plural of both), which is failing, in order that I can look at the next step; routing / firewall issues?
As an aside, if I run this:

watch -n1 'ss | grep "\<23\>"'

I was hoping to see sockets in the state SYN-SENT, but don't. Is this because I used REJECT, rather than DROP? Thanks

Graham Nicholls
  • 291
  • 2
  • 5
  • 13

2 Answers2

4

Let's try to answer the question in another way (hard way). Read the source of the kernel to see, what there is only one place, where this metric increments - tcp_done function. As we can see in the code, the incrementing happens only for connections in SYN_SEND or SYN_RECV states. Then we check, from where the tcp_done can be called. And we can found several places:

  1. tcp_reset - called at abort of connection (reply packet with rst flag received). Yep, it can happen in SYN_SENT and SYN_RECV states (and in other states, theoretically).
  2. tcp_rcv_state_process - called in states TCP_FIN_WAIT1 and TCP_LAST_ACK, so the metric isn't incremented - it's not our case.
  3. tcp_v4_error - called in case of SYN_SENT or SYN_RECV. The tcp_v4_error function called by the ICMP handler.
  4. tcp_time_wait - called at moving the socket into time-wait or fin-wait-2 states - not our case too.
  5. tcp_write_error - called from several places at timeouts and retransmit count exceeded. It can be our suspect too.

Now, open any TCP FSM diagram to check, in what cases the our connection can be in SYN_SENT or SYN_RECV.

In client case it can be only SYN_SENT state, where the syn packets is transmiting, and connection aborted due receiving of reject (tcp-rst or icmp error) or the reply isn't received.

In server case it can be only SYN_RECV state (syn is already received and syn+ack is already sent), and connection aborted due receiving of reject (syn+ack rejected somewhere) or the reply waiting timeout is exceeded (an ack isn't received).

Now you know the reasons of update of this metric and can check the possible sources of it in your system. In modern kernel there are a powerfull tools to troubleshooting at kernel level. Begin from this brief tutorial from Brendan Gregg.

Anton Danilov
  • 5,082
  • 2
  • 13
  • 23
  • Looking for where the metric is incremented was my next step - I've done the same recently for another metric I was investigating - but it's not quite so simple - you have to check what metric netstat reads to call "failed connection attempts", as the kernel calls it something else. Thank you. – Graham Nicholls Nov 29 '17 at 19:51
  • 1
    The netstat tool reads all metrics from /proc/net/snmp file. You can list the content of this file with the cat. The metrics have short names, but similar with full metric names in netstat output. Ex, the "failed connection attempts" is called as "AttemptFails" in /proc/net/snmp. Next step is find the corresponded identificator in kernel. I've used 'grep "AttemptFails" ./*' in src/linux-kernel-source/net/ipv4/ directory. It has returned "TCP_MIB_ATTEMPTFAILS". Then I've found all calls of TCP_INC_STATS with TCP_MIB_ATTEMPTFAILS and described results in details above. – Anton Danilov Nov 30 '17 at 06:28
0

Once significant source of dropped connections seems to be attempts to connect to non-responsive servers. Remember, we believe that "failed connection attempts" refers to outgoing connections.

Running

ss | awk '$1 ~ /SYN-SENT/ {print $NF}'

10.160.32.211:8312
10.160.33.61:8312
10.160.32.146:8312
10.160.33.216:8312
10.160.34.186:8312
10.160.35.18:8312
10.160.32.157:8312
10.160.33.159:8312
10.160.34.246:8312

shows many connections in this state. Interestingly, it points to them all attempting to connect to the same port. If I try random ip addresses from that list and attempt to connect to port 8312 with telnet - eg:

$ telnet 10.160.34.246 8312
telnet: connect to address 10.160.32.48: Connection timed out

Sending a SYN packet is the first step in establishing a connection. The other side should respond with a SYN-ACK packet - in which case we respond with an ACK, and the connection is established. If however, there is a firewall between the two servers, blocking the connection, then the SYN-ACK will not be forthcoming, so the socket stays in the SYN_SENT state until it times out.
Here's a diagram stolen from lwn.net:

Diagram of TCP 3-way handshake

This timeout is not long (I'm trying to find out how long, and will update appropriately) - as far as I can tell thus far it's in the order of a couple of seconds (I'd have thought 2x MSL, where MSL is the maximum segment lifetime - but that's a guess).

Now, we need to differentiate between connection attempts where a SYN is sent and nothing returns, and one where a RST is returned. A firewall in the way is normally quite rude; it will drop the original SYN packet silently - it won't send a RST, which is the normal way of letting a client know that there's nothing here.

You can see similar behaviour by trying to connect to www.google.co.uk on a port which you suspect they won't be listening on - eg:

$ telnet www.google.co.uk 32654
Trying 74.125.203.94... telnet: connect to address 74.125.203.94: Connection timed out

Whilst simultaneously running something like this:

while true ; do ss | awk '/SYN-SENT/ && $NF !~ /^10./' ; sleep 2 ; done
SYN-SENT 0 1 10.137.6.62:46088 74.125.203.94:32654
SYN-SENT 0 1 10.137.6.62:46088 74.125.203.94:32654
SYN-SENT 0 1 10.137.6.62:46088 74.125.203.94:32654

Now, I'm inside a corporate network, and almost certainly access to google on a normal port 80/443 is proxied, and any other ports are firewalled, so we don't expect to see RST packets. This is why in the question, I ask about the difference in my IPTables rules between REJECT and DROP. DROP simply discards the packet in IPTables, whereas REJECT sends a RST, I believe.

What I'll do next is tcpdump a connection to a non-listening port, and update appropriately.

$ tcpdump -nn -t -i eth0 dst 8.8.8.8
tcpdump: WARNING: eth0: no IPv4 address assigned
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet),
capture size 65535 bytes
IP 10.137.6.62.40822 > 8.8.8.8.12345: Flags [S], seq 505811469, win 14600, options [mss 1460,sackOK,TS val 1513647100 ecr 0,nop,wscale 9], length 0
IP 10.137.6.62.40822 8.8.8.8.12345: Flags [S], seq 505811469, win 14600, options [mss 1460,sackOK,TS val 1513648100 ecr 0,nop,wscale 9], length 0
IP 10.137.6.62.40822 > 8.8.8.8.12345: Flags [S], seq 505811469, win 14600, options [mss 1460,sackOK,TS val 1513650100 ecr 0,nop,wscale 9], length 0
IP 10.137.6.62.40822 > 8.8.8.8.12345: Flags [S], seq 505811469, win 14600, options [mss 1460,sackOK,TS val 1513654100 ecr 0,nop,wscale 9], length 0
IP 10.137.6.62.40822 > 8.8.8.8.12345: Flags [S], seq 505811469, win 14600, options [mss 1460,sackOK,TS val 1513662100 ecr 0,nop,wscale 9], length 0
IP 10.137.6.62.40822 > 8.8.8.8.12345: Flags [S], seq 505811469, win 14600, options [mss 1460,sackOK,TS val 1513678100 ecr 0,nop,wscale 9], length 0

TODO: Add a tcpdump of the case where there is no firewall so we see RST packets.

A caveat There are many useful sources of information concerning Linux TCP connection debugging. Red Hat is one such source. On one of their pages, they suggest using the dropwatch tool, to establish where in the kernel networking stack packets are being dropped. What that page fails to say is that "dropping" packets from a software stack is normal - once a packet has been dealt with, it is dropped. The dropwatch tool makes no distinction between a packet which is dropped because it is finished with, and one which is dropped because of a buffer overflow, or an interrupt budget timeout or ...

Caveat Emptor.

Graham Nicholls
  • 291
  • 2
  • 5
  • 13
  • I wonder if there's a typo in the port number somewhere - 8132 is a possible mistyping of 3128 - often used for proxy servers. Too late to find out now. – Graham Nicholls Nov 09 '20 at 21:32