1

Most of time, my rsh cycle is general OK, we could get following logs from rshd:

Aug 19 04:36:34 shmm500 authpriv.info in.rshd[21343]: connect from 172.17.0.40 (172.17.0.40)
Aug 19 04:36:34 shmm500 auth.info rshd[21344]: root@172.17.0.40 as root: cmd='echo 481'

While for some error case, the rsh could success but there are several seconds delay, see the following timestamp:

Aug 19 04:12:24 shmm500 authpriv.info in.rshd[17968]: connect from 172.17.0.40 (172.17.0.40) 
Aug 19 04:12:27 shmm500 auth.info rshd[17972]: root@172.17.0.40 as root: cmd='echo 18'

I also found that, for most normal case, the PID increased by 1, while for most error case, PID increasd by 4, see the PID in above logs, seems rshd forks some processes. So would you provide any explanation for why rshd took these several seconds and PID increase.

Our rsh is the old rsh, not ssh, I'm not sure, but seems the rsh is from netkit. And this is an embedded board with busybox, no strace/pstack. For client side, I just 'rsh 172.17.0.8 pwd', not hostname is used.

Qiu Yangfan
  • 871
  • 11
  • 25
  • Often this is because in.rshd is trying to do a reverse DNS lookup to get the host name of the remote system, and the DNS request is timing out. You could try `strace -f -p pid-of-the-in.rshd-daemon` just before running the `rsh` command and see what the `in.rshd` process does just before it pauses. – Mark Plotnick Dec 23 '14 at 14:41
  • For client side, I just 'rsh 172.17.0.8 pwd', no hostname is used. I'm not sure if DNS process is launched for my case. And There's no strace on my board. – Qiu Yangfan Dec 24 '14 at 00:49
  • The delay is likely on the server side rather than the client side. Look at the list of hosts in the `.rhosts` file of the user on the server. If any host is not in `/etc/hosts`, then the authentication mechanism will probably do a DNS lookup, which may take a few seconds, for example if a DNS server isn't reachable. – Mark Plotnick Dec 24 '14 at 04:22
  • @Mark Plotnick: I checked, not my case. – Qiu Yangfan Dec 24 '14 at 08:43

1 Answers1

1

Answer the question by myself:

This issue was caused by a frame loss. Either SYN or SYN+ACK in 3-way handshake was dropped at a rare rate for some reason, anyway the client peer didn't get the SYN+ACK within in 3 seconds timeout(this timeout is hardcoded in Linux kernel), then the connect() resent SYN again, and usually successful at the second try.

From the viewpoint of application, we got 3 seconds delay, or even 6 seconds if it failed at the second try.

Other relevant information:

The first log is from tcpd(aka tcp wrapper)

Aug 19 04:36:34 shmm500 authpriv.info in.rshd[21343]: connect from 172.17.0.40 (172.17.0.40)

The second log is from rshd in netkit 0.17

Aug 19 04:36:34 shmm500 auth.info rshd[21344]: root@172.17.0.40 as root: cmd='echo 481'

rsh need two tcp connections, the first is from rsh client to rshd, and the second tcp connection is from rshd to rsh client, which means the rshd is the tcp client. And my issue is frame loss on the second tcp connection.

Qiu Yangfan
  • 871
  • 11
  • 25