Parsing header from file based stream yields unexpected results

Question

I am writing a small parser to process some log files using node streams (io.js, actually, but I dont think that matters).

I am following the example in the docs for unshift to parse out the header. I can successfully split the buffer and grab the header, but as soon as I call stream.unshift it seems to concatenate the header string and the remaining string.

In setting up some example code for this question I found out that this behavior occurs when I am looking at a file based stream. Whenever I use a string based stream, the issue does not happen, even though file has the exact same text as the string.

Here is what the file looks like in my text editor with whitespace chars turned on, (for comparison): text editor output of example log file

I need some assistance understanding why this is happening.

var StringDecoder = require('string_decoder').StringDecoder;

// setup string based stream in fake_stream
    var Stream = require('stream');
    var fake_file = 'FILE_TYPE:SOME-HEADER-DATE\r\n'
                  + 'HEADER_END\r\n'
                  + '1234|logged data|1|2|3|4|5|some other logged data\x1E\r\n'
                  + '1235|logged data|1|2|3|4|5|some other logged data\x1E\r\n'
                  + '1236|logged data|1|2|3|4|5|some other logged data\x1E\r\n'
    var fake_stream = new Stream.Readable();
    fake_stream.push(new Buffer(fake_file, 'utf8'));
    fake_stream.push(null);


// setup file based stream in file_stream
// the file minimal_test_log.glf has the text shown above (with the control characters unescaped)
    var fs = require('fs');
    var file = 'C:\\Some\\Path\\To\\minimal_test_log.glf';
    var file_stream = fs.createReadStream(file);



// WHY AM I GETTING DIFFERENT RESULTS HERE?

    parseHeader(file_stream, function(err, header, stream) {
        console.log('processing file_stream: ' + header.length);
        // RESULTS:  processing file_stream: 184
        // this results in the both parts concatenated without the HEADER_END/r/n
    });

    parseHeader(fake_stream, function(err, header, stream) {
        console.log('processing fake_stream: ' + header.length);
        // RESULTS:  processing fake_stream: 28
        // these results are what i would expect, everything before HEADER_END
    });



// Slightly modified example found at https://iojs.org/api/stream.html#stream_readable_unshift_chunk

function parseHeader(stream, callback) {
    stream.on('error', callback);
    stream.on('readable', onReadable);

    var decoder = new StringDecoder('utf8');
    var header = '';

    function onReadable() {

        var chunk, buf, remaining;
        var header_boundary = /HEADER_END\r\n/g;

        while (null !== (chunk = stream.read())) {

            var str = decoder.write(chunk);

            if (str.match(header_boundary)) {

                var split = str.split(header_boundary);
                header += split.shift();

                remaining = split.join('');
                buf = new Buffer(remaining, 'utf8');

                if (buf.length) {
                    stream.unshift(buf);
                }

                // the header length is different starting at this point

                stream.removeListener('error', callback);
                stream.removeListener('readable', onReadable);

                callback(null, header, stream);

            } else {
                header += str;
            }
        }
    }
}

Hari Seldon · Answer 1 · 2015-03-09T17:00:52.840

So adding a counter to onReadable shows that it is being called twice. Since the scope of the header declaration is broader than onReadable, it retains whatever was being stored in header. The second time through the onReadable function, header_boundary will not match, and the if statement short circuits to the else clause, appending the rest of the log to header.

I reread the docs on the readable event, and learned that

Once the internal buffer is drained, a readable event will fire again when more data is available

I believe that is what happening when I call stream.unshift(buf);. A second readable event is being fired whenever I add data back into the stream with unshift.

The second readable event is being fired for both the "file" based and the "string" based streams. It seems, however, that the timing is a little different between them. In the "string" based stream, by the time the second readable event is fired, the callback has already executed. In the "file" based stream, the callback is not executed until after the readable event fires and the extra data has been appended to the header variable.

I've not quite worked out why the timing is different between the streams, but it gives me enough to go on to continue my work.

Parsing header from file based stream yields unexpected results

1 Answers1