Onnxruntime NodeJS set intraOpNumThreads and interOpNumThreads by execution mode

Question

I'm using Onnxruntime in NodeJS to execute onnx converted models in cpu backend to run inference.

According to the docs, the optional parameters are the following:

     var options = {

            /**
             * 
             */
            executionProviders: ['cpu'],

            /*
             * The optimization level.
             * 'disabled'|'basic'|'extended'|'all'
            */
            graphOptimizationLevel: 'all',

            /**
             * The intra OP threads number.
             * change the number of threads used in the threadpool for Intra Operator Execution for CPU operators 
             */
            intraOpNumThreads: 1,

            /**
             * The inter OP threads number.
             * Controls the number of threads used to parallelize the execution of the graph (across nodes).
             */
            interOpNumThreads: 1,

            /**
             * Whether enable CPU memory arena.
             */
            enableCpuMemArena: false,

            /**
             * Whether enable memory pattern.
             *
             */
            enableMemPattern: false,

            /**
             * Execution mode.
             * 'sequential'|'parallel'
             */
            executionMode: 'sequential',

            /**
             * Log severity level
             * @see ONNX.Severity
             * 0|1|2|3|4
             */
            logSeverityLevel: ONNX.Severity.kERROR,

            /**
             * Log verbosity level.
             *
             */
            logVerbosityLevel: ONNX.Severity.kERROR,

        };

Specifically, I can control (like in Tensorflow) the threading parameters intraOpNumThreads and interOpNumThreads, that are defined as above.

I want to optimize both of them for the sequential and parallel execution modes (controlled by executionMode parameter defined above). My approach was like

var numCPUs = require('os').cpus().length;
options.intraOpNumThreads = numCPUs;

in order to have at least a number of threads like the number of available cpus, hence on my macbook pro I get this session configuration for sequential execution mode:

{
  executionProviders: [ 'cpu' ],
  graphOptimizationLevel: 'all',
  intraOpNumThreads: 8,
  interOpNumThreads: 1,
  enableCpuMemArena: false,
  enableMemPattern: false,
  executionMode: 'sequential',
  logSeverityLevel: 3,
  logVerbosityLevel: 3
}

and for parallel execution mode I set both:

{
  executionProviders: [ 'cpu' ],
  graphOptimizationLevel: 'all',
  intraOpNumThreads: 8,
  interOpNumThreads: 8,
  enableCpuMemArena: false,
  enableMemPattern: false,
  executionMode: 'parallel',
  logSeverityLevel: 3,
  logVerbosityLevel: 3
}

or another approach could be to consider a percentage of the available cpus:

var perc = (val, tot) => Math.round( tot*val/100 );
var numCPUs = require('os').cpus().length;
if(options.executionMode=='parallel') { // parallel
   options.interOpNumThreads = perc(50,numCPUs);
   options.intraOpNumThreads = perc(10,numCPUs);
} else { // sequential
   options.interOpNumThreads = perc(100,numCPUs);
   options.intraOpNumThreads = 1;
}

but I do not find any doc to ensure this is the optimal configuration for those two scenaries based on the executionMode ('sequential' and 'parallel' execution modes). Is theoretically correct this approach?

score 1 · Answer 1 · answered May 09 '22 at 19:47

1

It really depends on the model structure. Usually, I use sequential execution mode because most models are sequential models - for example for a CNN model each layer depends on the previous layer, so you have to execute each layer one by one.

My answer is to try testing different configs and pick your choice based on perf numbers.

Another consideration is how do you expect your application to perform: to consume all CPUs for best performance (lowest inference latency) or reach to a balance for performance and power consumption. The choice is totally up to you.

answered May 09 '22 at 19:47

eire

11
1

Thanks. Two good points. My model arch is transformer (BERT), and in terms of generated onnx graph I really don’t know which ops can run parallel. This shall be investigated. In terms of performance it’s true that having a cpu bound solution could cause scalability issues and bottlenecks in the inference system. Assumed power consumption it’s not a thing here the balance will be among low latency and cpu load. – loretoparisi May 09 '22 at 20:58

Onnxruntime NodeJS set intraOpNumThreads and interOpNumThreads by execution mode

1 Answers1

Linked