2

I have the following pipeline:

readFile > parseCSV > otherProcess

The readFile is the standard Node.Js createReadStream, while the parseCSV is a Node.js transform stream (module link).

I want to iterate through a csv file line by line and handle a single line at the time. Therefore, streams and async iterator are a perfect match.

I have the following code which is working properly:

async function* readByLine(path, opt) {
  const readFileStream = fs.createReadStream(path);
  const csvParser = parse(opt);
  const parser = readFileStream.pipe(csvParser);
  for await (const record of parser) {
    yield record;
  }
}

I'm quite new to Node.Js streams, but I've read from many sources that the module stream.pipeline is preferred to the .pipe method of read streams.

How can I change the code above in order to use the stream.pipeline (actually the promise version got from util.promisify(pipeline)) and yielding one line at the time?

2 Answers2

2

Adding to @eol's answer, I would recommend storing the promise and awaiting it after the async iteration.

const fs = require('fs');
const parse = require('csv-parse');
const stream = require('stream');

async function* readByLine(path, opt) {
    const readFileStream = fs.createReadStream(path);
    const csvParser = parse(opt);
    const promise = stream.promises.pipeline(readFileStream, csvParser);
    for await (const record of csvParser) {
        yield record;
    }
    await promise;
}

By calling await pipeline(...) before the loop, it will consume the whole stream before you can iterate from whatever is left in the buffer, which works by accident on small streams but is likely to break on larger (or infinite/lazy) streams.

The callback equivalent might make it more clear what's going on depending on where we await.

// await before iterating
stream.pipeline(a, b, err => {
  if (err) return callback(err)

  for await (const record of b) {
    // process record
  }

  callback()
}

// await after iterating
for await (const record of stream.pipeline(a, b, callback)) {
  // process record
}
Val
  • 323
  • 3
  • 8
  • I'm reading a csv file and need to turn each record into a postgres row after adding a column and checking it's not a dupe. I prefer this example because it is more concise and the input file size is not a consideration. But, I'm new to Node. How is readByLine() invoked outside of its block to return a single record? Is .next involved? Where are errors handled? I'd like to put it all into async.auto form. Thanks for any help. – viejo Sep 27 '22 at 17:23
  • @viejo the function flow is different than "normal" functions because it's an [async iterator](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Iteration_protocols#the_async_iterator_and_async_iterable_protocols) - .next() is called automatically by the `for await of` structure below – Val Oct 11 '22 at 10:16
  • Thank you, Val. I copied your upper code and called readByLine() but got back only "Object [AsyncGenerator] {}". I'm sure there are missing parts to your example that are obvious to a knowledgeable person. To make a functional program, do I have to place all of the record processing and database checking/adding logic where you have placed the line 'yield record;' ? In other words, calling readByLine() multiple times will not produce one record after another sequentially? – viejo Oct 12 '22 at 20:44
  • 1
    For anyone as much a novice as myself, I recommend https://medium.com/@segersian/howto-async-generators-in-nodejs-c7f0851f9c02 to explain iteration using function generators. – viejo Oct 15 '22 at 12:00
1

You should actually be able to just pass both the fs-stream and the parser-stream to pipeline() and use your async iterator on the parser-stream:

const fs = require('fs');
const parse = require('csv-parse');
const stream = require('stream')
const util = require('util');
const pipeline = util.promisify(stream.pipeline);

async function* readByLine(path, opt) {
    const readFileStream = fs.createReadStream(path);
    const csvParser = parse(opt);
    await pipeline(readFileStream, csvParser);
    for await (const record of csvParser) {
        yield record;
    }
}
eol
  • 23,236
  • 5
  • 46
  • 64
  • With this I get the following error `[ERR_INVALID_CALLBACK]: Callback must be a function. Received Parser` at this line `await pipeline(readFileStream, csvParser);`. I didn't mention that I'm using typescript, but I think it should not change the problem – Alessandro Staffolani Dec 31 '20 at 16:59
  • Weird, worked fine for me with node 14.x.x on debian - can't test it right now, but I'll have another look at it tomorrow. – eol Dec 31 '20 at 17:09
  • If it can help, I'm running on macos with node 12.11 – Alessandro Staffolani Dec 31 '20 at 19:15
  • Also works fine for me on 12.11.0, are you sure you're using the same code as I did? Are the imports the same? – eol Jan 01 '21 at 15:54
  • 2
    I'm really sorry, It works with the proper imports. I copied and pasted your code, but in mine I called `pipeline = util.promisify(stream.pipeline);` `pipelinePromise`, while `pipeline` was the stream pipeline function. I'll mark it as correct, thanks – Alessandro Staffolani Jan 02 '21 at 18:45