Node.js on multi-core machines for file I/O operations

Question

I'm a bit confused because all the examples I read about Node cluster module only seem to apply to webservers and concurrent requests. Otherwise to CPU intensive application it is recommended to use the worker_threads module.

And what about I/O file operations? Imagine I have an array with 1 million filenames: ['1.txt', '2.txt', etc., ..., '1000000.txt'] and I need to do heavy processing and then write the result file content?

What would be the method to efficiently use all the cores of the CPU to spread the processing towards different cores amongst different filenames?

Normally I would use this:

const fs = require('fs')
const fs = require('async')
const heavyProcessing = require('./heavyProcessing.js')

const files = ['1.txt', '2.txt', ..., '1000000.txt']

async.each(files, function (file, cb) {
  fs.writeFile(file, heavyProcessing(file), function (err) {
    if (!err) cb()
  })
}

Should I use now the cluster or the worker_threads? And how should I use it?

Does this work?

const fs = require('fs')
const fs = require('async')
const heavyProcessing = require('./heavyProcessing.js')

const cluster = require('node:cluster');
const http = require('node:http');
const numCPUs = require('node:os').cpus().length;
const process = require('node:process');

if (cluster.isPrimary) {
  console.log(`Primary ${process.pid} is running`);

  // Fork workers.
  for (let i = 0; i < numCPUs; i++) {
    cluster.fork();
  }

  cluster.on('exit', (worker, code, signal) => {
    console.log(`worker ${worker.process.pid} died`);
  });
} else {
  const files = ['1.txt', '2.txt', ..., '1000000.txt']

  async.each(files, function (file, cb) {
    fs.writeFile(file, heavyProcessing(file), function (err) {
      if (!err) cb()
    })
  }
}

This won't do what you want: each process in your cluster will process all your files. If this is a production requirement, it's probably worth your trouble to write code that uses a pool of worker threads. — O. Jones, Sep 01 '22 at 19:45
@O.Jones would you recommend me any module? What about `piscina`? — João Pimentel Ferreira, Sep 01 '22 at 20:03
@O.Jones what about this? https://github.com/piscinajs/piscina/issues/270 — João Pimentel Ferreira, Sep 01 '22 at 20:32
@O.Jones found it, I did this gist explaining it: https://gist.github.com/jfoclpf/325bb925fedf50a9cf96bd00d99e2243 — João Pimentel Ferreira, Sep 02 '22 at 15:57

João Pimentel Ferreira · Accepted Answer · 2022-10-26T22:11:56.883

Just for everyone to know, if they are interested, you need to use the npm module piscina.

In this gist I explain everything. NodeJS is a powerful tool for backend developers, but you must be aware of multi-core processing in order to maximize the potential of your CPU. This NodeJS multi-core feature is mostly used for webservers and NodeJS has already out of the box the cluster module thereto. Although NodeJS has also out of the box the module threads, it's not so easy to deal with.

Let's create a project that will test single-thread and multi-thread CPU intensive data and write some random data to file.

Create the project:

mkdir test-threads && cd test-threads
npm init -y

Install dependencies and create dist/ directory

npm install async progress piscina command-line-args
mkdir dist

Create the file index.js at the root of the project directory

const path = require('path')
const async = require('async')
const ProgressBar = require('progress')
const Piscina = require('piscina')
const commandLineArgs = require('command-line-args')

console.time('main')

const worker = require(path.resolve(__dirname, 'worker.js'))
const piscina = new Piscina({
  filename: path.resolve(__dirname, 'worker.js')
})

const argvOptions = commandLineArgs([
  { name: 'multi-thread', type: Boolean },
  { name: 'iterations', alias: 'i', type: Number }
])

const files = []
for (let i=0; i < (argvOptions.iterations || 1000); i++) {
  files.push(path.join(__dirname, 'dist', i + '.txt'))
}

var bar = new ProgressBar(':bar', { total: files.length, width: 80 });

async.each(files, function (file, cb) {
  (async function() {
    try {
      const err = argvOptions['multi-thread'] ? (await piscina.run(file)) : worker(file)
      bar.tick()
      if (err) cb(Error(err)); else cb()
    } catch(err) {
      cb(Error(err))
    }
  })();
}, (err) => {
  if (err) {
    console.error('There was an error: ', err)
    process.exitCode = 1
  } else {
    bar.terminate()
    console.log('Success')
    console.timeEnd('main')
    process.exitCode = 0
  }
})

Create now worker.js also at the root of the project directory

const fs = require('fs')

// some CPU intensive function; the higher is baseNumber, the higher is the time elapsed
function mySlowFunction(baseNumber) {
  let result = 0
  for (var i = Math.pow(baseNumber, 7); i >= 0; i--) {      
    result += Math.atan(i) * Math.tan(i)
  }
}

module.exports = (file) => {
  try {
    mySlowFunction(parseInt(Math.random() * 10 + 1))
    fs.writeFileSync(file, Math.random().toString())
    return null
  } catch (e) {
    return Error(e)
  }
}

Now just run on single thread and check time elapsed, for 1000 and 10000 iterations (one iteration equals to data processing and file creation)

node index.js -i 1000
node index.js -i 10000

Now compare with the great advantage of multi-thread

node index.js --multi-thread -i 1000
node index.js --multi-thread -i 10000

With the test I did (16 cores CPU), the difference is huge, it went with 1000 iterations from 1:27.061 (m:ss.mmm) for single thread to 8.884s with multi-thread. Check also the files inside dist/ to be sure they were created correctly.

An excellent-quality contribution, this is. – O. Jones Sep 02 '22 at 19:39 — O. Jones, Sep 02 '22 at 19:39

Node.js on multi-core machines for file I/O operations

1 Answers1

Linked