i'm somewhat new to NodeJS streams, and the more i learn about it, the more i believe it's not a particularly simple and stable thing. i'm attempting to read big files with csv / csv-parse (apparently the most popular CSV module with NodeJS) using the piping API, which involves using stream-transform by the same author.
part of what i'm experiencing here is actually reproducible without actually using the parser, so i out-commented those parts to make the example simpler (for those who prefer JavaScript over CoffeeScript, there's also a JS version):
#-------------------------------------------------------------------------------
fs = require 'fs'
transform_stream = require 'stream-transform'
log = console.log
as_transformer = ( method ) -> transform_stream method, parallel: 11
# _new_csv_parser = require 'csv-parse'
# new_csv_parser = -> _new_csv_parser delimiter: ','
#-------------------------------------------------------------------------------
$count = ( input_stream, title ) ->
count = 0
#.............................................................................
input_stream.on 'end', ->
log ( title ? 'Count' ) + ':', count
#.............................................................................
return as_transformer ( record, handler ) =>
count += 1
handler null, record
#-------------------------------------------------------------------------------
read_trips = ( route, handler ) ->
# parser = new_csv_parser()
input = fs.createReadStream route
#.............................................................................
input.on 'end', ->
log 'ok: trips'
return handler null
input.setMaxListeners 100 # <<<<<<
#.............................................................................
# input.pipe parser
input.pipe $count input, 'trips A'
.pipe $count input, 'trips B'
.pipe $count input, 'trips C'
.pipe $count input, 'trips D'
# ... and so on ...
.pipe $count input, 'trips Z'
#.............................................................................
return null
route = '/Volumes/Storage/cnd/node_modules/timetable-data/germany-berlin-2014/trips.txt'
read_trips route, ( error ) ->
throw error if error?
log 'ok'
the input file contains 204865 lines of GTFS data; i'm not parsing it here, just reading it raw, so i guess what i'm counting with the above code is chunks of data.
i'm piping the stream from counter to counter and would expect to hit the last counter as often as the first one; however, this is what i get:
trips A: 157
trips B: 157
trips C: 157
...
trips U: 157
trips V: 144
trips W: 112
trips X: 80
trips Y: 48
trips Z: 16
in an earlier setup where i actually did parse the data, i got this:
trips A: 204865
trips B: 204865
trips C: 204865
...
trips T: 204865
trips U: 180224
trips V: 147456
trips W: 114688
trips X: 81920
trips Y: 49152
trips Z: 16384
so it would appear that the stream somehow runs dry along its way.
my suspicion was that the end
event of the input stream is not a reliable signal to listen to when
trying to decide whether all processing has finished—after all, it is logical to assume that processing
can only complete some time after the stream has been fully consumed.
so i looked for another event to listen to (didn't find one) and to delay calling the callback (with
setTimeout
, process.nextTick
and setImmediate
), but to no avail.
it would be great if someone could point out
- (1) what the crucial differences between
setTimeout
,process.nextTick
andsetImmediate
are in this context, and - (2) how to reliably determine whether the last byte has been processed by the last member of the pipe.
Update i now believe the problem lies with stream-transform which has an issue open where someone reported a very similar problem with practically identical figures (he has 234841 records and ends up with 16390, i have 204865 and end up with 16384). not a proof, but too close to be accidental.
i ditched stream-transform and use event-stream.map instead; the test then runs OK.