I'm currently developing a remote job scheduler on perl. It has to connect via ssh to x servers and execute already defined jobs/jobs groups.
I use Net:SSH2 which is build upon libssh2.
My program usually works fine with like 400/500 servers, but when i try to run the basic uptime
command on 1000 servers, one or more of my threads hangs and never finishes, or like 30 minutes after.
It's random : sometimes it finishes on time, sometimes not.
I tracked the problem as coming from this Net::SSH2 command : $in .= $buf while $chan->read( $buf, 10240 );
Here is the full code of the connection :
my $chan = $this->{netssh2}->channel() or die $!;
$chan->blocking(1);
$chan->exec($command);
my ($in,$err,$buf,$buf_err);
$in .= $buf while $chan->read( $buf, 10240 );
$err .= $buf_err while $chan->read( $buf_err, 10240, 1 );
$chan->send_eof;
1 while !$chan->eof;
$chan->wait_closed;
I then downloaded a Net::SSH2 source package and modified the C-perl linking (xs) file.
It showed me that the problem comes from this line :
count = libssh2_channel_read_ex(ch->channel, XLATEXT, pv_buffer, size);
This command comes with the libssh2 library : http://www.libssh2.org/libssh2_channel_read_ex.html
Sometimes (about 1 in 1000 times) the program enters this read and never leaves. Servers affected are differents most of the time.
Do you have any idea what I should be looking for/checking ? I've been working on this for a few day, I'd like an external advice very much :)