how to work with large files, NodeJS streams, and pipes

Question

i'm somewhat new to NodeJS streams, and the more i learn about it, the more i believe it's not a particularly simple and stable thing. i'm attempting to read big files with csv / csv-parse (apparently the most popular CSV module with NodeJS) using the piping API, which involves using stream-transform by the same author.

part of what i'm experiencing here is actually reproducible without actually using the parser, so i out-commented those parts to make the example simpler (for those who prefer JavaScript over CoffeeScript, there's also a JS version):

#-------------------------------------------------------------------------------
fs                        = require 'fs'
transform_stream          = require 'stream-transform'
log                       = console.log
as_transformer            = ( method ) -> transform_stream method, parallel: 11
# _new_csv_parser           = require 'csv-parse'
# new_csv_parser            = -> _new_csv_parser delimiter: ','

#-------------------------------------------------------------------------------
$count = ( input_stream, title ) ->
  count = 0
  #.............................................................................
  input_stream.on 'end', ->
    log ( title ? 'Count' ) + ':', count
  #.............................................................................
  return as_transformer ( record, handler ) =>
    count += 1
    handler null, record

#-------------------------------------------------------------------------------
read_trips = ( route, handler ) ->
  # parser      = new_csv_parser()
  input       = fs.createReadStream route
  #.............................................................................
  input.on 'end', ->
    log 'ok: trips'
    return handler null
  input.setMaxListeners 100 # <<<<<<
  #.............................................................................
  # input.pipe parser
  input.pipe $count input, 'trips A'
    .pipe $count    input, 'trips B'
    .pipe $count    input, 'trips C'
    .pipe $count    input, 'trips D'
    # ... and so on ...
    .pipe $count    input, 'trips Z'
  #.............................................................................
  return null

route = '/Volumes/Storage/cnd/node_modules/timetable-data/germany-berlin-2014/trips.txt'
read_trips route, ( error ) ->
  throw error if error?
  log 'ok'

the input file contains 204865 lines of GTFS data; i'm not parsing it here, just reading it raw, so i guess what i'm counting with the above code is chunks of data.

i'm piping the stream from counter to counter and would expect to hit the last counter as often as the first one; however, this is what i get:

trips A: 157
trips B: 157
trips C: 157
...
trips U: 157
trips V: 144
trips W: 112
trips X: 80
trips Y: 48
trips Z: 16

in an earlier setup where i actually did parse the data, i got this:

trips A: 204865
trips B: 204865
trips C: 204865
...
trips T: 204865
trips U: 180224
trips V: 147456
trips W: 114688
trips X: 81920
trips Y: 49152
trips Z: 16384

so it would appear that the stream somehow runs dry along its way.

my suspicion was that the end event of the input stream is not a reliable signal to listen to when trying to decide whether all processing has finished—after all, it is logical to assume that processing can only complete some time after the stream has been fully consumed.

so i looked for another event to listen to (didn't find one) and to delay calling the callback (with setTimeout, process.nextTick and setImmediate), but to no avail.

it would be great if someone could point out

(1) what the crucial differences between setTimeout, process.nextTick and setImmediate are in this context, and
(2) how to reliably determine whether the last byte has been processed by the last member of the pipe.

Update i now believe the problem lies with stream-transform which has an issue open where someone reported a very similar problem with practically identical figures (he has 234841 records and ends up with 16390, i have 204865 and end up with 16384). not a proof, but too close to be accidental.

i ditched stream-transform and use event-stream.map instead; the test then runs OK.

If you solved your problem yourself, please post a self-accepted answer. It will show the question as _solved_ to visitors arriving to this page from a search on the web. — Sylvain Leroux, Aug 07 '14 at 13:50

score 2 · Accepted Answer · answered Aug 09 '14 at 15:00

2

some days later i think i can say that stream-transform has a problem with big files.

i've since switched to event-stream which is IMHO a better solution overall as it is completely generic (i.e. it's about streams in general, not about CSV-data-as-streams in particular). i've outlined some thoughts about stream libraries in NodeJS in the documentation for my incipient pipdreams module that provides a number of commonly used stream operations.

answered Aug 09 '14 at 15:00

flow

3,624
36
48

I've found that using the built in Transform stream in the current node streams module to be pretty easy these days. Not quite as trivial as writing a map function, but there's no overhead, extra modules, or weird bugs. – Chris Tavares Jun 07 '17 at 15:45

how to work with large files, NodeJS streams, and pipes

1 Answers1