performance.now using onnxruntime for sequential and parallel execution modes

Question

I'm using Onnxruntime in NodeJS to execute onnx converted models in cpu backend. I run model inference in parallel using Promise.allSettled:

var promises = sequences.map(seq => self.inference(self.session, self.tokenizer, seq));
results = (await Promise.allSettled(promises)).filter(p => p.status === "fulfilled").map(p => p.value);

running this class instance method inference that calls a static method Util.performance.now

  ONNX.prototype.inference = async function (session, tokenizer, text) {
        const default_labels = this._options.model.default_labels;
        const labels = this._options.model.labels;
        const debug = this._options.debug;
        try {
            const encoded_ids = await tokenizer.tokenize(text);
            if (encoded_ids.length === 0) {
                return [0.0, default_labels];
            }
            const model_input = ONNX.create_model_input(encoded_ids);
            const start = Util.performance.now();
            const output = await session.run(model_input, ['output_0']);
            const duration = Util.performance.now(start).toFixed(1);

            const sequence_length = model_input['input_ids'].size;
            if (debug) console.log("latency = " + duration + "ms, sequence_length=" + sequence_length);
            const probs = output['output_0'].data.map(ONNX.sigmoid).map(t => Math.floor(t * 100));

            const result = [];
            for (var i = 0; i < labels.length; i++) {
                const t = [labels[i], probs[i]];
                result[i] = t;
            }
            result.sort(ONNX.sortResult);

            const result_list = [];
            for (i = 0; i < 6; i++) {
                result_list[i] = result[i];
            }
            return [parseFloat(duration), result_list];
        } catch (e) {
            return [0.0, default_labels];
        }
    }//inference

the timing is wrong and summed up. The performance object looks like

Util = {
   performance: {
      now: function (start) {
          if (!start) {
              return process.hrtime();
          }
          var end = process.hrtime(start);
          return Math.round((end[0] * 1000) + (end[1] / 1000000));
      }
  }
}

and it used in the usual way

// this runs parallel
const start = Util.performance.now();
// computation
const duration = (Util.performance.now() - start).toFixed(1);

Now, within the performance fun the start and end variables scope is local, so what happens using Promise.allSettled? I would expect that timing would be correct due to the local scope.

trincot · Accepted Answer · 2022-05-04T15:59:19.290

The timing mechanics are correct, but when session.run is called, it will initiate some asynchronous API (non-JavaScript) while already returning the pending promise object. This will allow other executions of inference to call session.run as well, leading to a state where the asynchronous API is dealing with several such requests concurrently, and so the same slices of time are counted in multiple execution contexts of inference. These requests may even end at moments in time that are quite close. When that happens, one after the other executions of inference resume with the code that ends the timing (setting their duration variables). It is clear that these durations can and probably will overlap each other.

To visualise it for 3 executions of inference, you could have this:

   start-----------------------------start+duration
      start-----------------------------start+duration
         start------------------------------start+duration

 time ----->

If you don't want this parallelism, you should not make all calls of inference at almost the same time, but wait with each next execution until the previous one is resolved:

for (let seq of sequences) {
     let value = await self.inference(self.session, self.tokenizer, seq));
     // ...
}

This way executions of the asynchronous part of session.run will not overlap, and will not suffer performance because of that concurrency. I would expect timings to be more like this:

   start---------------start+duration
                       start-------------start+duration
                                         start------------start+duration

 time ----->

Now timeslices will not be counted more than once, even though the total duration of the whole process will likely be longer.

Note that the behaviour has nothing to to with Promise.allSettled, because you'll get those outputs anyway, even if you remove that Promise.allSettled call from your program.

Thanks a lot! It's definitively correct, `session.run` will call the `onnxruntime` method `run` of their [ort module](https://www.npmjs.com/package/onnxruntime-node) that is this one https://github.com/microsoft/onnxruntime/blob/master/js/node/src/inference_session_wrap.h#L50, so it's native code in fact. — loretoparisi, May 04 '22 at 16:04
I also confirm that using `for` async loop the time overlap will work correctly. That was my default implementation, that for performance reasons I'm turning parallel in some cases. — loretoparisi, May 04 '22 at 16:06
I digged a bit into the execution provided by onnxruntime, and came out with another question, that is "slightly" connected to this (due to the `sequential` or `parallel` nature of the two approaches exposed in the answer - https://stackoverflow.com/questions/72117675/onnxruntime-nodejs-set-intraopnumthreads-and-interopnumthreads-by-execution-mode — loretoparisi, May 04 '22 at 18:17

performance.now using onnxruntime for sequential and parallel execution modes

1 Answers1

Linked