1

Node.js pipeline with a ReadStream which reads data from a Postgres database using knex.js stream, then a Transform Stream which transforms the data, followed by another Transform Stream that batches the chunks with a record size of 500. Finally, a Writable Stream is used to update the Postgres database. For smaller data, there are no issues. However, when the data is large, the stream execution reaches the await update query and hangs. How can I resolve this?

I have tried to implement an ETL pipeline using streams that reads data from one table and updates data in another table. The stream pauses when the data is large. Note: The read and update processes are performed on different tables, but they both use the same transaction (Knex).

const transaction = await knex.transaction();
const readStream = transaction.raw("SELECT * from Books").stream();
const transformStream = new Transform({
  objectMode: true,
  highWaterMark: 1000,
  transform(chunk, _encoding, next) {
    this.push(chunk);
  },
});
let batch: unknown[] = [];
const batchTransform = new Transform({
  objectMode: true,
  highWaterMark: 1,
  transform(chunk, encoding, next) {
    batch.push(chunk);
    if (batch.length >= 500) {
      next(null, batch);
      batch = [];
    }
  },
  flush(next) {
    next(null, batch);
  },
});
const writableStream = new Writable({
  objectMode: true,
  highWaterMark: 1,
  async write(chunk, _encoding, callback) {
    const queryText = `
      UPDATE "author"
      SET "Book" = 'test'
      WHERE "id" = '1';
    `;
    await transaction.raw(queryText); //stream ends here
    callback();
  },
});

await pipeline([readStream, transformStream, batchTransform, writableStream]);
SB07
  • 11
  • 3
  • "*The read and update processes are performed on different tables, but they both use the same transaction*" - that would be your problem I guess. One transaction means one postgres connection means sequential statements. While your connection is occupied by reading from the `SELECT` statement output, it cannot already run the `UPDATE` statement(s). The `await` blocks until the connection is available and has executed the update query, but this is blocking the read stream from finishing the current query execution. – Bergi Aug 07 '23 at 10:44
  • You'll need to [use a cursor](https://www.postgresql.org/docs/current/sql-declare.html) instead so you can alternate between fetching and updating rows – Bergi Aug 07 '23 at 10:52
  • …although [knex `.stream()` is said to use cursors](https://github.com/knex/knex/issues/656#issuecomment-1168070827) already, so maybe it's just your `highWatermark` being too low or something so that it gets stuck? – Bergi Aug 07 '23 at 11:00
  • "One transaction means one postgres connection means sequential statements" - If that is the case, how come it works for smaller data volume. The problem arises when the volume exceeds the 32000 mark. Also, if I were to set a highWaterMark of 10,000, it would consume too much memory, which is not what I expect. Do you think replacing the knex stream with a async generator would help resolve the issue? – SB07 Aug 07 '23 at 12:10

0 Answers0