Node.js pipeline with a ReadStream which reads data from a Postgres database using knex.js stream, then a Transform Stream which transforms the data, followed by another Transform Stream that batches the chunks with a record size of 500. Finally, a Writable Stream is used to update the Postgres database. For smaller data, there are no issues. However, when the data is large, the stream execution reaches the await update query and hangs. How can I resolve this?
I have tried to implement an ETL pipeline using streams that reads data from one table and updates data in another table. The stream pauses when the data is large. Note: The read and update processes are performed on different tables, but they both use the same transaction (Knex).
const transaction = await knex.transaction();
const readStream = transaction.raw("SELECT * from Books").stream();
const transformStream = new Transform({
objectMode: true,
highWaterMark: 1000,
transform(chunk, _encoding, next) {
this.push(chunk);
},
});
let batch: unknown[] = [];
const batchTransform = new Transform({
objectMode: true,
highWaterMark: 1,
transform(chunk, encoding, next) {
batch.push(chunk);
if (batch.length >= 500) {
next(null, batch);
batch = [];
}
},
flush(next) {
next(null, batch);
},
});
const writableStream = new Writable({
objectMode: true,
highWaterMark: 1,
async write(chunk, _encoding, callback) {
const queryText = `
UPDATE "author"
SET "Book" = 'test'
WHERE "id" = '1';
`;
await transaction.raw(queryText); //stream ends here
callback();
},
});
await pipeline([readStream, transformStream, batchTransform, writableStream]);