6

NodeJS server with a Mongo DB - one feature will generate a report JSON file from the DB, which can take a while (60 seconds up - has to process hundreds of thousands of entries).

We want to run this as a background task. We need to be able to start a report build process, monitor it, and abort it if the user decides to change the params and re build it.

What is the simplest approach with node? Don't really want to get into the realms of separate worker servers processing jobs, message queues etc - we need to keep this on the same box and fairly simple implementation.

1) Start the build as a async method, and return to the user, with socket.io reporting progress?

2) Spin off a child process for the build script?

3) Use something like https://www.npmjs.com/package/webworker-threads?

With the few approaches I've looked at I get stuck on the same two areas;

1) How to monitor progress? 2) How to abort an existing build process if the user re-submits data?

Any pointers would be greatly appreciated...

Matt Bryson
  • 2,286
  • 2
  • 22
  • 42

1 Answers1

7

The best would be to separate this task from your main application. That said, it'd be easy to run it in the background. To run it in the background and monit without message queue etc., the easiest would be a child_process.

  1. You can launch a spawn job on an endpoint (or url) called by the user.
  2. Next, setup a socket to return live monitoring of the child process
  3. Add another endpoint to stop the job, with a unique id returned by 1. (or not, depending of your concurrency needs)

Some coding ideas:

var spawn = require('child_process').spawn

var job = null //keeping the job in memory to kill it

app.get('/save', function(req, res) {

    if(job && job.pid)
        return res.status(500).send('Job is already running').end()

    job = spawn('node', ['/path/to/save/job.js'], 
    {
        detached: false, //if not detached and your main process dies, the child will be killed too
        stdio: [process.stdin, process.stdout, process.stderr] //those can be file streams for logs or wathever
    })

    job.on('close', function(code) { 
        job = null 
        //send socket informations about the job ending
    })

    return res.status(201) //created
})

app.get('/stop', function(req, res) {
    if(!job || !job.pid)
        return res.status(404).end()

    job.kill('SIGTERM')
    //or process.kill(job.pid, 'SIGTERM')
    job = null
    return res.status(200).end()
})

app.get('/isAlive', function(req, res) {
    try {
        job.kill(0)
        return res.status(200).end()
    } catch(e) { return res.status(500).send(e).end() }
})

To monit the child process you could use pidusage, we use it in PM2 for example. Add a route to monit a job and call it every second. Don't forget to release memory when job ends.


You might want to check out this library which will help you manage multi processing across microservices.

soyuka
  • 8,839
  • 3
  • 39
  • 54
  • Thanks for the answer @soyuka. With the `job` var - that holds on to a reference to the child, so you can stop it - but does it work on PID? PID get re used don't they? So the job we spawn might complete and its PID is freed up for any other new process to take? Which means job.kill() could potentially kill a different process if it works of PID alone? Or does it not work like that.... – Matt Bryson Apr 27 '15 at 14:41
  • Just check the docs and it says... "May emit an 'error' event when the signal cannot be delivered. Sending a signal to a child process that has already exited is not an error but may have unforeseen consequences: if the PID (the process ID) has been reassigned to another process, the signal will be delivered to that process instead. What happens next is anyone's guess." But nulling the reference on completion should solve that as per your exmaple!! Sorry, missed that. – Matt Bryson Apr 27 '15 at 14:46
  • This code seems to allow only one reporting job at a time and will overwrite (and lose track of the previous job) if an attempt is made to start a 2nd one. – jfriend00 Apr 27 '15 at 14:51
  • @jfriend00 indeed, that's why I spoke about the concurrency needs (not stated in the question). If you want more jobs just keep a pid-cache array holding child processes. @MattBryson indeed, you'll have to be sure that the memory reference is removed when job ends (`exit` or `close` events). `close` is related to `stds` and `exit` when signal is caught. – soyuka Apr 27 '15 at 14:58
  • @jfriend00 added a condition if jobs is already running. Please keep in mind that it's a small draft, I wanted to give some hints about how I would do this but I'm not doing a full working example ;). – soyuka Apr 27 '15 at 15:04
  • Yeah - I appreciate the effort, the actual implementation will be 1 job per report - so Ill need some kind of dictionary, keyed by report IDs, each holding a job instance. Then Ill clear the reference on error / completion. – Matt Bryson Apr 28 '15 at 09:45
  • Regarding performance, this will spin up a whole new Node instance for each job will it ? But only loading the dependencies that I require in my script? – Matt Bryson Apr 28 '15 at 09:46
  • Yes. It'll spawn a child process with your export job for example. As said before this is only a draft, if you want to hold job instances keep a dictionnary with `job.pid`'s by making sure that on error/completion they are removed from it. – soyuka Apr 28 '15 at 09:57