How do you work with 2 large streams from different sources with nodejs?

Question

I am working to create a report engine for multiple application, from multiple database in NodeJS. The problem is, I can't think of any way to merge 2 big data sets into one data without waiting for both streams to finish..

For example, this is my stream.. let's say.. stream 1

const request = new mssql.Request();
request.stream = true;
request.query('SELECT id, transaction, delivery FROM LargeTable');
request.on('row', row => {
    processStream(row);
}

then this is my second stream..

const connection = mysql.createConnection(...);
connection.query('SELECT id, transaction, supplydrop FROM AnotherLargeTable')
    .stream()
    .pipe(processStream);

Those only 2 example table. The real datasets will be determined by the user. We also can't assume that they are going to use mssql and mysql only. There could be another datasets there that the user want to merge, such as from files, or CSVs, or consuming data from web. Both are running at different speed and bandwidth.

Now, if user wants to aggregate the data and filter them, I can't possibly make something like:

select id, transaction, delivery, supplydrop FROM LargeTable lt1
inner join AnotherLargeTable lt2 on lt1.transaction = lt2.transaction
where delivery = 'Customer1' and supplydrop = 'bad supplier'

The problem is, that I had to wait until those 2 streams ends, and joins both data, and that would consume a large chunk of memory, and has a potential to crash the system. While I could do some kind of filtering on both data sources, but joining can only be done after both streams are complete.

Is there any way to work with both datasets, without exhausting too much of memory, or is there some kind like small memory based database engine that I can pipe both streams into, and join them? Is implementing a in-memory database solution an overkill? Or is there something else that fit this particular job nicely?

Thank you for your time.

Naor Tedgi · Answer 1 · 2022-03-05T08:27:52.157

I think the best solution here is to investigate whose table response is bigger.

then after you found out trying to fit the smaller data source into memory try think on a better compact way to represent the data : for example:

instead of returning json object maybe consider working with CSV this is an experiment I made of a large database response (the heap memory represent the json object loaded to memory)

or if you want to stick with json object try to reduce the metadata for example, lets take t1

instead of json object array of

[{id:1, transaction:2, delivery:3, supplydrop:4},{id:2, transaction:2, delivery:3, supplydrop:4}....]

transpose it to

 {id:[1,2] , transaction:[2,2] ,delivery:[3,3].....}}

you can try GRPC but I found that it takes a lot of effort to boilerplate into the new project from scratch.
you can try msgpack also a light wight json binary representation https://msgpack.org/index.html

before each step, you take to try to download the data directly from the data source for example for mssql as a file and then check how much data it will consume from heap with this simple script

import { memoryUsage } from 'process';
import fs from 'fs';
const FILE = "path to downloaded file"

function LOG_USED_HEAP() {
    console.log(`Memory usage: ${memoryUsage().heapUsed / 1024 / 1024} MB`);
}

const dataPayloadInHeap = [];

setInterval(() => {
    LOG_USED_HEAP();
}, 1000);

fs.createReadStream(FILE).on('data', (chunk) => {
    dataPayloadInHeap.push(chunk);
}).on('end', () => {
    console.log('end');
    LOG_USED_HEAP();
})

last thing if you find the most compact way to store the data in memory and it still does not works try to use more heap memory with this flag --max-old-space-size=5012 if that does not work and you didn't manage to solve it

now if you reach here and still didn't manage to solve it applicative why

you need to think of using big data solution to save both of the tables response into to s3 and process them with a detected tool such as Spark with Flink on AWS

How do you work with 2 large streams from different sources with nodejs?

1 Answers1