1

I have user.json (assume that will be a large file, I want to stream read this file, but limit the chunk size).

[
  {
    "name": "John Doe",
    "occupation": "gardener",
    "born": "1992-03-02"
  },
  {
    "name": "Brian Flemming",
    "occupation": "teacher",
    "born": "1967-11-22"
  },
  {
    "name": "Lucy Black",
    "occupation": "accountant",
    "born": "1995-04-07"
  },
  {
    "name": "William Bean",
    "occupation": "pilot",
    "born": "1977-10-31"
  }
]

My sample code.

const fs = require('fs');
const stream = require('stream');

async function logChunks(readable) {
  for await (const chunk of readable) {
    console.log('---------start')
    console.log(chunk.toString());
    console.log('---------end')
  }
}

const readStream = fs.createReadStream('users.json', {highWaterMark: 120 })
logChunks(readStream)

The output looks like

---------start
[
  {
    "name": "John Doe",
    "occupation": "gardener",
    "born": "1992-03-02"
  }
  ,
  {
    "name": "Brian Flem
---------end
---------start
ming",
    "occupation": "teacher",
    "born": "1967-11-22"
  }
  ,
  {
    "name": "Lucy Black",
    "occupation": "ac
---------end
---------start
countant",
    "born": "1995-04-07"
  }
  ,
  {
    "name": "William Bean",
    "occupation": "pilot",
    "born": "1977
---------end
---------start
-10-31"
  }
]

---------end

My goal is to extract the json object from the multiple chunck, so that it can be JSON.parse.

I don't find any JSONStreamParse for node.js, so I hope that I could get some expertise ideas here. Thanks


Update:

I got one option solution is use 3rd party solution. stream-json

await util.promisify(stream.pipeline)(
    readStream,
    StreamArray.withParser(),
    async function( parsedArrayEntriesIterable ){
      for await (const {key: arrIndex, value: arrElem} of parsedArrayEntriesIterable) {
        console.log("Parsed array element:", arrElem);
      }
    }
  )
icelemon
  • 827
  • 2
  • 11
  • 23
  • Reposting this comment since I commented on the wrong question. Is there any particular reason not to save all the chunks in a buffer and parse the entire json string at the end? I can show you that answer easily, otherwise, we have to write a custom parser to split the incomplete json strings into two such as the valid part and the incomplete part. Waiting for the whole json string is not that bad idea since the user does not get blocked in the whole process of reading. The main thread of JavaScript event loop gets control on every iteration of loop since each iteration is asynchronous. – waterloos Aug 12 '21 at 16:58
  • Im also Interested in the solution, i have currently no use case, but im curious how that praser would work. (and how to extend end it to work with arrays/nested objects) – Marc Aug 13 '21 at 10:38
  • @Summer Thanks to your update, I realized there is a better solution with the library you posted. You can also use this library for your other question https://stackoverflow.com/questions/68705813/fail-to-parse-on-a-json-stream-using-node-fetch. I will update that answer too when I have time. – waterloos Aug 15 '21 at 08:54

1 Answers1

0

I read your update on your question and realized that the comment I left on your question was totally off the point. Since you are using stream you didn't want to wait for all the data to avoid the memory exhaustion. I should have noticed that at the beginning.

Let me give you some examples for my appologies. I hope this helps understanding how to use streams.

To make the samples more realistic, let's simulate fetching json from the remote server like node-fetch does. node-fetch returns the instance of ReadableStream that is also asyncIterable. We can create it easily by passing asynchronous generator function to stream.Readable.from() as below.

Definition of fetch()

async function* asyncGenerator (chunks) {
  let counter = 1;
  for (const chunk of chunks) {
    await new Promise(resolve => setTimeout(resolve, 1000));
    console.log(`==== chunk ${counter++} transmitted =====================`);
    yield chunk;
  }
}

const stream = require('stream');

// simulates node-fetch
async function fetch (json) {
  const asyncIterable = asyncGenerator(json);
  // let the client wait for 0.5 sec.
  await new Promise(resolve => setTimeout(resolve, 500));
  return new Promise(resolve => {
    // returns the response object
    resolve({ body: stream.Readable.from(asyncIterable) });
  });
}

fetch() takes 0.5 sec to fetch the response object. It returns the Promise which resolves to the object of which body provides the ReadableStream. This readable stream keeps sending the chunk of json data to downstream every second as defined in asyncGenerator().

Our fetch() function takes an array of chunked json as a parameter instead of URL. Let us use the one you provided but we split it at the slightly different point so after receiving the second chunk, we get the two complete objects.

const chunkedJson = [
  // chunk 1
  `[
  {
    "name": "John Doe",
    "occupation": "gardener",
    "born": "1992-03-02"
  }
  ,
  {
    "name": "Brian Flem`,
  // chunk 2
  `ming",
    "occupation": "teacher",
    "born": "1967-11-22"
  }
  ,
  {
    "name": "Lucy Black",
    "occupation": "accountant",
    "born": "1995-04-07"
  }`,
  // chunk 3
  `,
  {
    "name": "William Bean",
    "occupation": "pilot",
    "born": "1977`,
  // chunk 4
  `-10-31"
  }
]`
];

Now, with this data, you can confirm how fetch() works as follows.

Example 1: Testing fetch()

async function example1 () {
  const response = await fetch(chunkedJson);
  for await (const chunk of response.body) {
    console.log(chunk);
  }
}

example1();
console.log("==== Example 1 Started ==============");

The Output of Example 1.

==== Example 1 Started ==============
==== chunk 1 transmitted =====================
[
  {
    "name": "John Doe",
    "occupation": "gardener",
    "born": "1992-03-02"
  }
  ,
  {
    "name": "Brian Flem
==== chunk 2 transmitted =====================
ming",
    "occupation": "teacher",
    "born": "1967-11-22"
  }
  ,
  {
    "name": "Lucy Black",
    "occupation": "accountant",
    "born": "1995-04-07"
  }
==== chunk 3 transmitted =====================
,
  {
    "name": "William Bean",
    "occupation": "pilot",
    "born": "1977
==== chunk 4 transmitted =====================
-10-31"
  }
]

Now, let's handle each element of this json data without waiting for the whole data to arrive.

StraemArray is a subclass of stream.Transform. So it has the interface of both the ReadableStream and WritableStream. If stream instances are connected with pipe() you don't have to be worried about the backpressure so we pipe the two streams, ie. the ReadableStream obtained from fetch() and the instance of StreamArray together as response.body.pipe(StreamArray.withParser()) in the Example 2 below.

The pipe(StreamArray.withParser()) returns the instance of StreamArray itself for the method chaining so the pipeline variable now holds the reference to the transform stream that is also a readable stream. We can attach the event listener to it in order to consume the transformed data.

StreamArray emmits data event when the single object is parsed from the readable source. So pipiline.on('data', callback) handles chunk by chunk without waiting for the whole json data.

When the event listner is registered to the data event with pipiline.on('data', callback), the stream starts to flow.

Since we simulate data fetching asynchronously, you can see the !!!! MAIN THREAD !!!! in the console in the middle of data transmission. You can confirm that the main thread does not get blocked while waiting for the parsed data.

Example 2: Testing stream-json processing each array element on by one as it arrives

const StreamArray = require('stream-json/streamers/StreamArray');

async function example2 () {
  const response = await fetch(chunkedJson);
  const pipeline = response.body.pipe(StreamArray.withParser());
  const timer = setInterval(() => console.log("!!!! MAIN THREAD !!!!"), 500);
  pipeline.on('data', ({ key, value }) => {
    console.log("====== stream-json StreamArray() RESULT ========");
    console.log(value); // do your data processing here
  }).on('close', () => {
    clearInterval(timer); // stop the main thread console.log
  });
}

example2();
console.log("==== Example 2 Started ==============");

The Output of Example 2.

==== Example 2 Started ==============
!!!! MAIN THREAD !!!!
==== chunk 1 transmitted =====================
====== stream-json StreamArray() RESULT ========
{ name: 'John Doe', occupation: 'gardener', born: '1992-03-02' }
!!!! MAIN THREAD !!!!
!!!! MAIN THREAD !!!!
==== chunk 2 transmitted =====================
====== stream-json StreamArray() RESULT ========
{ name: 'Brian Flemming', occupation: 'teacher', born: '1967-11-22' }
====== stream-json StreamArray() RESULT ========
{ name: 'Lucy Black', occupation: 'accountant', born: '1995-04-07' }
!!!! MAIN THREAD !!!!
!!!! MAIN THREAD !!!!
==== chunk 3 transmitted =====================
!!!! MAIN THREAD !!!!
!!!! MAIN THREAD !!!!
==== chunk 4 transmitted =====================
====== stream-json StreamArray() RESULT ========
{ name: 'William Bean', occupation: 'pilot', born: '1977-10-31' }

Since all streams are instances of EventEmitter you can simply attach a callback to data event to consume the final data as in Example 2. However, it is preferable to use pipe() even for the final data consumption since pipe() handles the backpressure.

Backpressure problem occurs when the data consumption in downstream is slower than the upstream's data feed. For example, when your data handling takes time you might want to handle each chunk asynchronously. If handling next chunk finishes before the previous chunk, the next chunk gets pushed to downstream before the first one. If the dowstream depends on the first chunk before handling the next one, this causes trouble.

When you use the event listner, you have to manually control the pause and resume to avoid the backpressure (see this as an example). However, if you connect the streams with pipe() the backpressure problem is taken care internally. That means when downstream is slower than the upstream, pipe() will automatically pause the feeding to the downstream.

So let's create our own WritableStream in order to connect to the StreamArray with pipe(). In our case we recieve the binary data from the upstream (ie. StreamArray) rather than the string, we have to set objectMode to true. We override the _write() function which will internally be called from write(). You put all the data handling logic here and call callback() upon finishing. The upstream does not feed the next data until the callback is called when streams are connected with pipe().

In order to simulate backpressure we process chunk 1 and 3 for 1.5 second and chunk 0 and 4 for zero second below.

Example 3: Piping Our Own Stream Instance

class MyObjectConsumerStream extends stream.Writable {
  constructor(options) {
    super({ ...options, objectMode: true });
  }

  _write(chunk, encoding, callback) {
    const { key, value } = chunk; // receive from StreamArray of stream-json
    console.log("===== started to processing the chunk ........... ");
    setTimeout(() => {
      console.log("====== Example 3 RESULT ========");
      console.log(value); // do your data processing here
      callback(); // pipe() will pause the upstream until callback is called
    }, key % 2 === 0 ? 1500 : 0); // for second and fourth chunk it processes 0 sec!
  }
}

//--- Example 3: We write our own WritableStream to consume chunked data ------
async function example3 () {
  const response = await fetch(chunkedJson);
  response.body.pipe(StreamArray.withParser()).pipe(new MyObjectConsumerStream()).on('finish', () => {
    clearInterval(timer); // stop the main thread console.log
  });
  const timer = setInterval(() => console.log("!!!! MAIN THREAD !!!!"), 500);
}

example3();
console.log("==== Example 3 Started ==============");

The Output of Example 3.

==== Example 3 Started ==============
!!!! MAIN THREAD !!!!
==== chunk 1 transmitted =====================
===== started to processing the chunk ........... 
!!!! MAIN THREAD !!!!
!!!! MAIN THREAD !!!!
==== chunk 2 transmitted =====================
!!!! MAIN THREAD !!!!
====== Example 3 RESULT ========
{ name: 'John Doe', occupation: 'gardener', born: '1992-03-02' }
===== started to processing the chunk ........... 
!!!! MAIN THREAD !!!!
====== Example 3 RESULT ========
{ name: 'Brian Flemming', occupation: 'teacher', born: '1967-11-22' }
===== started to processing the chunk ........... 
==== chunk 3 transmitted =====================
!!!! MAIN THREAD !!!!
!!!! MAIN THREAD !!!!
====== Example 3 RESULT ========
{ name: 'Lucy Black', occupation: 'accountant', born: '1995-04-07' }
==== chunk 4 transmitted =====================
===== started to processing the chunk ........... 
!!!! MAIN THREAD !!!!
====== Example 3 RESULT ========
{ name: 'William Bean', occupation: 'pilot', born: '1977-10-31' }

You can confirm that received data is in order. You can also see that 2nd chunk's transmission starts while processing the first object since we set it to take 1.5 sec. Now, let's do the same thing using the event listener as follows.

Example 4: Backpressure Problem with Simple Callback

async function example4 () {
  const response = await fetch(chunkedJson);
  const pipeline = response.body.pipe(StreamArray.withParser());
  const timer = setInterval(() => console.log("!!!! MAIN THREAD !!!!"), 500);
  pipeline.on('data', ({ key, value }) => {
    console.log("===== started to processing the chunk ........... ");
    setTimeout(() => {
      console.log(`====== Example 4 RESULT ========`);
      console.log(value); // do your data processing here
    }, key % 2 === 0 ? 1500 : 0); // for second and thrid chunk it processes 0 sec!
  }).on('close', () => {
    clearInterval(timer); // stop the main thread console.log
  });
}

example4();
console.log("==== Example 4 Started ==============");

The Output of Example 4.

==== Example 4 Started ==============
!!!! MAIN THREAD !!!!
==== chunk 1 transmitted =====================
===== started to processing the chunk ........... 
!!!! MAIN THREAD !!!!
!!!! MAIN THREAD !!!!
==== chunk 2 transmitted =====================
===== started to processing the chunk ........... 
===== started to processing the chunk ........... 
!!!! MAIN THREAD !!!!
====== Example 4 RESULT ========
{ name: 'Brian Flemming', occupation: 'teacher', born: '1967-11-22' }
====== Example 4 RESULT ========
{ name: 'John Doe', occupation: 'gardener', born: '1992-03-02' }
!!!! MAIN THREAD !!!!
==== chunk 3 transmitted =====================
!!!! MAIN THREAD !!!!
====== Example 4 RESULT ========
{ name: 'Lucy Black', occupation: 'accountant', born: '1995-04-07' }
!!!! MAIN THREAD !!!!
==== chunk 4 transmitted =====================
===== started to processing the chunk ........... 
====== Example 4 RESULT ========
{ name: 'William Bean', occupation: 'pilot', born: '1977-10-31' }

Now, we see that the second element "Brian" arrives before "John". If the processing time is increased to 3 sec for chunk 1 and 3, the last element "William" also arrives before the third one "Lucy".

So it is a good practice to use pipe() rather than event listeners to consume data when the order of data arrival matters.

You might be wondering why the example code in the API doc uses their own chain() function to make the pipeline . It is the reccomended design pattern for error handling in stream programming in Node. If the error is thrown in the downstream of the pipeline, it does not propagate the error to the upstream. So you have to attach the callback on every stream in the pipeline as follows (here we assume to have three streams a, b, c).

a.on('error', callbackForA)
 .pipe(b).on('error', callbackForB)
 .pipe(c).on('error', callbackForC)

It looks cumbersome compared to the Promise chain which can simply add .catch() at the end of the chain. Even though we set all the error handlers as above it is still not enough.

When an error is thrown at the downstream the error caused stream is dettached from the pipeline with unpipe(), however, the upstream does not get destroyed automatically. This is because there is a possibility for multiple streams to be connected to the upstream for branching out the stream line. So you have to close all the upper streams from the each error handler by yourself when you use pipe().

To solve these problem the community provided the pipeline constructing libraries. I think the chain() from stream-chain is one of them. Since Node ver.10 the stream.pipeline is added for this functionality. We can use this official pipeline constructor since all the streams in stream-json are subclass of regular stream instances.

Before showing the usage of stream.pipiline let's modify MyObjectConsumerStream class to throw an error when the second object is beeing processed.

Custom Stream that Throws Error

class MyErrorStream extends MyObjectConsumerStream {
  _write(chunk, encoding, callback) {
    const { key, value } = chunk; // receive from StreamArray of stream-json
    console.log("===== started to processing the chunk ........... ");
    if (key === 2)
      throw new Error("Error in key 2");
    setTimeout(() => {
      console.log("====== Example 5 RESULT ========");
      console.log(value); // do your data processing here
      callback(); // pipe() will pause the upstream until callback is called
    }, key % 2 === 0 ? 1500 : 0); // for second and fourth chunk it processes 0 sec!
  };
}

stream.pipeline takes mutiple streams in order together with the error handler at the end. The error handler receives the instance of Error when an error is thrown, and receives null when successfully finished.

Example 5: The Use of stream.pipeline

async function example5 () {
  const response = await fetch(chunkedJson);
  const myErrorHandler = (timerRef) => (error) => {
    if (error)
      console.log("Error in the pipiline", error.message);
    else
      console.log("Finished Example 5 successfully");
    clearInterval(timerRef); // stop the main thread console.log
  }
  const timer = setInterval(() => console.log("!!!! MAIN THREAD !!!!"), 500);
  stream.pipeline(
    response.body,
    StreamArray.withParser(),
    new MyErrorStream(),
    myErrorHandler(timer)
  );
  console.log("==== Example 5 Started ==============");
}

example5();

The Output of Example 5

==== Example 5 Started ==============
!!!! MAIN THREAD !!!!
!!!! MAIN THREAD !!!!
==== chunk 1 transmitted =====================
===== started to processing the chunk ........... 
!!!! MAIN THREAD !!!!
!!!! MAIN THREAD !!!!
==== chunk 2 transmitted =====================
!!!! MAIN THREAD !!!!
====== Example 5 RESULT ========
{ name: 'John Doe', occupation: 'gardener', born: '1992-03-02' }
===== started to processing the chunk ........... 
====== Example 5 RESULT ========
{ name: 'Brian Flemming', occupation: 'teacher', born: '1967-11-22' }
===== started to processing the chunk ........... 
/Users/shito/Documents/git-repositories/javascript/stackoverflow/JS/FailToParseJasonStream/ParseChunkedJsonAnswer.js:211
      throw new Error("Error in key 2");
      ^

Error: Error in key 2
    at MyErrorStream._write (/Users/shito/Documents/git-repositories/javascript/stackoverflow/JS/FailToParseJasonStream/ParseChunkedJsonAnswer.js:211:13)
    at doWrite (internal/streams/writable.js:377:12)
    at clearBuffer (internal/streams/writable.js:529:7)
    at onwrite (internal/streams/writable.js:430:7)
    at Timeout._onTimeout (/Users/shito/Documents/git-repositories/javascript/stackoverflow/JS/FailToParseJasonStream/ParseChunkedJsonAnswer.js:215:7)
    at listOnTimeout (internal/timers.js:554:17)
    at processTimers (internal/timers.js:497:7)

When error is thrown, stream.pipeline() calls stream.destroy(error) on all streams that have not closed or finished properly. So we don't have to be worried about the memory leak.

waterloos
  • 410
  • 2
  • 7
  • Wow, thanks for trying out the different options here. But my purpose here is get rid of using 3rd party library "StreamArray.withParser()". I don't want to use a dependency here, I try to write own implementation to save more time since I don't need to parse the name/value filed in the JSON object. – icelemon Aug 15 '21 at 18:40
  • Could you help me take another look at https://stackoverflow.com/questions/68767486/how-to-customize-stream-transform-to-parse-stream-array-json – icelemon Aug 15 '21 at 19:05