4

UPDDATE AT BOTTOM -->

I’m using a Red Hat Enterprise Linux Server release 7.4 (Maipo) VM in my OS class of about 20 students who generally launch about two ssh connections to this machine with their own specific user ids. This seems to work fine as students trickle into the classroom.

However, at the start of class when most students try to log in, I have students who are unable to log into the system with an “ssh: connect to host xxx.xxx.xxx.xxx port 22: Connection refused” message. Waiting 20 minutes or so seems to eventually let some more people in. sshd is definitely running. The set of users being refused varies, and sometimes also includes me. I might have connected via ssh successfully a few minutes before, but then can't start a second session.

All of our outgoing traffic uses a Many-to-1 NAT setup, so all of the incoming ssh connections on the server will appear as coming form the same IP number. After looking at the docs and doing some digging I changed the following two parameter in the sshd_config file:

#MaxSessions 10
MaxSessions 500

and

#MaxStartups 10:30:100
MaxStartups 75:10:200

As I understand it MaxSessions governs the number of active ssh connection to the server - even if coming just from one IP number, while MaxStartups relates to initial connection attempts (e.g., people trying to log in who haven’t provided a password yet) so in this case I could accommodate 75 at startup, and then the rate would go by 10% until it would reach a limit of 200 (so should I set MaxSessions and this number be the same?)

I’m using password authentication, and root login is disabled. We generally log in from Windows 10 machines using the git bash shell (though I have also used putty to see if that would make a difference, it didn't).

In any case, am I on the right track here dealing with the login in issue? The problem is that I can’t reliably reproduce this at will. This problem only seems to occur in class when there are a bunch of connection attempts at the same time, I’m logging in and out w/o any troubles at other times, and none of the students has reported this problems at other times.

What else can I try to help diagnose and fix this problem? I know that this seems to be a type of error that occurs to many, and I've read a fair bit here, but haven't found one working fix yet.


UPDATE

So when I try to reproduce this problem with this small script (credit to @RobbieMckennie for giving me this idea)

LIMIT=5

for i in $(seq $LIMIT)
do
    echo 
    echo "============================= ${i} ==================="
    ssh -vvv userid@xx.xx.xxx.xx
    echo 
done

I'll get this after 3 login attempts:

$ ssh -vvv useridl@xx.xx.xxx.xx
OpenSSH_7.5p1, OpenSSL 1.0.2k  26 Jan 2017
debug1: Reading configuration data /etc/ssh/ssh_config
debug2: resolving "xx.xx.xxx.xx" port 22
debug2: ssh_connect_direct: needpriv 0
debug1: Connecting to xx.xx.xxx.xx [xx.xx.xxx.xx] port 22.
debug1: connect to address xx.xx.xxx.xx port 22: Connection refused
ssh: connect to host xx.xx.xxx.xx port 22: Connection refused

in fact I am able to reproduce this "by hand" if I login in 3 times quickly one after the other, the 4th attempt results in this. The originating ip number is in my ignoreip list in fail2ban (jail.local) and seems to work as far as I can tell

2017-10-12 07:38:04,481 fail2ban.filter         [52845]: WARNING Determined IP using DNS Lookup: c-yy-yy-yy-yyy.hsd1.il.comcast.net = ['yy.yy.yy.yyy']
2017-10-12 07:38:04,482 fail2ban.filter         [52845]: INFO    [sshd] Ignore yy.yy.yy.yyy by ip

though I'm not sure if the warning means anything.

So, two questions:

  1. What is causing this rejection? I don't even get to the system as far as I can tell. Is there a configuration setting I need to tweak?

  2. More importantly, when my 22 students all try to log in from campus, all of these connections originate from the same IP number due to our Many-to-1-NAT, would that explain this? It seems to me it might(?)

The only thing that is different, it takes about 15 minutes or so for students to be able to log in when this rejection happens, while in my experiment above I get back in within a few seconds. Is that maybe due to some sort of backlog?

In particular this entry I just discovered this entry in the IPtables

Chain INPUT_direct (1 references)
target     prot opt source               destination
           tcp  --  anywhere             anywhere             tcp dpt:ssh state NEW recent: SET name: DEFAULT side: source mask: 255.255.255.255

REJECT     tcp  --  anywhere             anywhere             tcp dpt:ssh state NEW recent: UPDATE seconds: 30 **hit_count: 4** name: DEFAULT side: source mask: 255.255.255.255 reject-with tcp-reset

This would explain the limit of 3 logins, but again, I’m not sure that would explain the 15 minute wait or so on campus to log back in when we encounter this.

Levon
  • 143
  • 1
  • 1
  • 6
  • I assume you're executing the proper command to restart `sshd`, to load the updated settings? – Robbie Mckennie Oct 09 '17 at 02:07
  • @RobbieMckennie yes, I've restarted sshd .. I guess I will see how things go tomorrow, but since this only happens in class I wanted to see if there's something else I can/need to try or set before class as I really can't reproduce this at will to test it out on my own. – Levon Oct 09 '17 at 02:17
  • Be sure to post back afterwards, I'm interested to know how it goes. – Robbie Mckennie Oct 09 '17 at 02:37
  • @RobbieMckennie nope, didn't work, had the same problem again today :-/ – Levon Oct 09 '17 at 20:31
  • 1
    I think in order to make any quick progress you really need to sort out a way to replicate the issue. Perhaps writing a shell script to initiate many ssh sessions may work. I was able to generate network errors on my own machine in this way, but not specifically the connection refused error. – Robbie Mckennie Oct 09 '17 at 23:18
  • The trouble is that there are many possible sources for the error. It could be in the sshd settings, in the server's kernel configuration, the NAT setup you're working with, too many variables. Any suggestion I might have is little more than a stab in the dark. – Robbie Mckennie Oct 09 '17 at 23:20
  • @RobbieMckennie I like your idea of using a script to try to replicate this, that's an excellent suggestion, I may try that. I agree, it's hard to diagnose a problem without being able to reproduce it at will. If I figure out the source of this problem, I will definitely report back here in case someone else is stuck with a similar problem. – Levon Oct 10 '17 at 02:02
  • @RobbieMckennie I did write a script and found out some more info, take a look if you can - thanks. – Levon Oct 12 '17 at 13:54
  • 1
    I would try flushing your `iptables` rules, `sudo iptables -F`. If this works it will be a temporary fix, but it will indicate if `iptables` is causing the problem. – Robbie Mckennie Oct 12 '17 at 21:21

2 Answers2

3

I had a similar problem and it ended up being some nasty bots taking all the few connection slots available by default.

Unlike what we can understand from the man pages, MaxStartups won't drop pending connections in a FIFO order when the limit is reached but will ignore all the new connections. So if you have the default 10 value and have 10 people connecting without sending anything, you won't be able to login anymore. Basically you want the number to be very high if you don't want someone to easily lock you out of your system by opening a lot of connections (high DoS potential with the default setting).

There is another setting involved : LoginGraceTime. It is the time before an unauthenticated user gets kicked by the server, which is 600 seconds by default. This is explaining why OP sees a delay of arround 15 minutes before students can actually login. You want this setting to be as low as possible so dummy connections can be quickly discarded.

I'm probably going to contact the debian's openssh maintainer to discuss this problem. Default configuration shouldn't be that vulnerable to DoS attacks.

EDIT : I took a look on bug reports, CVE and openssh source code. Increasing MaxStartups has a direct and continuous impact on RAM usage since connections are handled in a single-time allocation fashion. Basically there is some "malloc(MAX_STARTUPS*sizeof(connection))". Actually fixing this will require some major rework of the way openssh deals with memory allocation, which takes the time that nobody has.

JulienCC
  • 133
  • 3
3

I know I'm late ;-) but I guess you have fail2ban running, or something similar?

Fail2ban can help to protect all kinds of daemons against brute fore attacks. For sshd, fail2ban temporarily blocks ports for IP addressed that fail to login repeatedly. There several approaches to solve this situation: Stop fail2ban, whitelist the school's IP, ...

  • Yup, that was the cause :) I eventually figured it out, but good to leave the solution here for other's benefit. – Levon Jul 06 '20 at 22:17