9

We've got a Node.js script that is run once a minute to check the status of our apps. Usually, it works just fine. If the service is up, it exits with 0. If it's down, it exits with 1. All is well.

But every once in a while, it just kinda stops. The console reports "Calling status API..." and stops there indefinitely. It doesn't even timeout at Node's built-in two-minute timeout. No errors, nothing. It just sits there, waiting, forever. This is a problem, because it blocks following status check jobs from running.

At this point, my whole team has looked at it and none of us can figure out what circumstance could make it hang. We've built in a start-to-finish timeout, so that we can move on to the next job, but that essentially skips a status check and creates blind spots. So, I open the question to you fine folks.

Here's the script (with names/urls removed):

#!/usr/bin/env node

// SETTINGS: -------------------------------------------------------------------------------------------------
/** URL to contact for status information. */
const STATUS_API = process.env.STATUS_API;

/** Number of attempts to make before reporting as a failure. */
const ATTEMPT_LIMIT = 3;

/** Amount of time to wait before starting another attempt, in milliseconds. */
const ATTEMPT_DELAY = 5000;

// RUNTIME: --------------------------------------------------------------------------------------------------
const URL = require('url');
const https = require('https');

// Make the first attempt.
make_attempt(1, STATUS_API);

// FUNCTIONS: ------------------------------------------------------------------------------------------------
function make_attempt(attempt_number, url) {
    console.log('\n\nCONNECTION ATTEMPT:', attempt_number);
    check_status(url, function (success) {
        console.log('\nAttempt', success ? 'PASSED' : 'FAILED');

        // If this attempt succeeded, report success.
        if (success) {
                console.log('\nSTATUS CHECK PASSED after', attempt_number, 'attempt(s).');
                process.exit(0);
        }

        // Otherwise, if we have additional attempts, try again.
        else if (attempt_number < ATTEMPT_LIMIT) {
            setTimeout(make_attempt.bind(null, attempt_number + 1, url), ATTEMPT_DELAY);
        }

        // Otherwise, we're out of attempts. Report failure.
        else {
            console.log("\nSTATUS CHECK FAILED");
            process.exit(1);
        }
    })
}

function check_status(url, callback) {
    var handle_error = function (error) {
        console.log("\tFailed.\n");
        console.log('\t' + error.toString().replace(/\n\r?/g, '\n\t'));
        callback(false);
    };

    console.log("\tCalling status API...");
    try {
        var options = URL.parse(url);
        options.timeout = 20000;
        https.get(options, function (response) {
            var body = '';
            response.setEncoding('utf8');
            response.on('data', function (data) {body += data;});
            response.on('end', function () {
                console.log("\tConnected.\n");
                try {
                    var parsed = JSON.parse(body);
                    if ((!parsed.started || !parsed.uptime)) {
                        console.log('\tReceived unexpected JSON response:');
                        console.log('\t\t' + JSON.stringify(parsed, null, 1).replace(/\n\r?/g, '\n\t\t'));
                        callback(false);
                    }
                    else {
                        console.log('\tReceived status details from API:');
                        console.log('\t\tServer started:', parsed.started);
                        console.log('\t\tServer uptime:', parsed.uptime);
                        callback(true);
                    }
                }
                catch (error) {
                    console.log('\tReceived unexpected non-JSON response:');
                    console.log('\t\t' + body.trim().replace(/\n\r?/g, '\n\t\t'));
                    callback(false);
                }
            });
        }).on('error', handle_error);
    }
    catch (error) {
        handle_error(error);
    }
}

If any of you can see any places where this could possibly hang without output or timeout, that'd be very helpful!

Thank you, James Tanner

EDIT: p.s. We use https directly, instead of request so that we don't need to do any installation when the script runs. This is because the script can run on any build machine assigned to Jenkins without a custom installation.

James Tanner
  • 1,542
  • 1
  • 11
  • 10
  • I would check the status code in your response callback, if it's not equal to 200, then raise error. – Keith Sep 18 '17 at 14:15
  • Oh, sorry @Keith, I don't think I was clear on that. Success is determined by the response. A 200 code isn't necessarily sufficient. – James Tanner Sep 18 '17 at 14:22
  • Edited my comment. I'd hit "Add" before I finished typing. – James Tanner Sep 18 '17 at 14:25
  • I'm not saying 200 is sufficient, You still need to check the status response, you might get a `503 Service Unavailable` or something else. So you still get a response, but you will not receive any `data` or `end` events, so will hang, because you will never end up calling your callback. – Keith Sep 18 '17 at 14:27
  • Ooh, I see. I thought we handle that with `.on('error', handle_error);` Is there a different/better way? And can you give an example? **Edit:** I think I found what you're talking about: https://stackoverflow.com/questions/23712392/http-get-nodejs-how-to-get-error-status-code I'll give this a try and see if it resolves the issue. – James Tanner Sep 18 '17 at 14:32
  • 1
    You not that far off to be honest, I'll post a small snippet with the extra check. Oh, just noticed you found a link, yes.. Implement the extra check, and you should be good to go.. – Keith Sep 18 '17 at 14:37
  • Feel free to add an answer if you'd like the points. I'll do some testing to see if that solves the problem, but it looks right to me. – James Tanner Sep 18 '17 at 14:43

2 Answers2

6

Aren't you missing the .end()?

http.request(options, callback).end()

Something like explained here.

Marek
  • 1,413
  • 2
  • 20
  • 36
2

Inside your response callback your not checking the status..

The .on('error', handle_error); is for errors that occur connecting to the server, status code errors are those that the server responds with after a successful connection.

Normally a 200 status response is what you would expect from a successful request..

So a small mod to your http.get to handle this should do..

eg.

https.get(options, function (response) {
  if (response.statusCode != 200) {
    console.log('\tHTTP statusCode not 200:');
    callback(false);
    return; //no point going any further
  }
  ....
Keith
  • 22,005
  • 2
  • 27
  • 44
  • Unfortunately, this does not appear to be the solution. I added this in, and it still hung periodically overnight. I've added some additional logging in to try and identify where, exactly, it gets to. I'll update my post with more details when I get them. – James Tanner Sep 19 '17 at 11:59
  • oh, another idea. Maybe the error is not getting the connection, but during the connection.. try putting a `response.on('error', handle_error);` – Keith Sep 19 '17 at 12:12
  • Trying this now! Just have to wait for it to a) error, or b) hang. Which is basically at random and seems to happen overnight. – James Tanner Sep 19 '17 at 12:51
  • No luck. In fact, I added a console log in the callback and apparently it's never even called. – James Tanner Sep 19 '17 at 16:05
  • I'm running out of ideas :(, but one idea is maybe do what npm request does and create your own timeout. Also I'm assuming your running node version greater than equal to v6.8.0, as that's when the timeout option was added. – Keith Sep 19 '17 at 16:22