Puzzling mxnet performance under Python vs Mathematica

Question

I compared mxnet performance between Mathematica and Python and observe more than an order of magnitude performance differences and would like advises on how to improve performance under Python.

My NN is an MLP for regression, with 3 float inputs, 8, 16, 24, 8 neurons fully connected layers and 2 float output, Sigmoid is used everywhere except on the input and output neurons. The optimizer used in Mathematica is Adam so I used this too in Python with the same parameters. The training dataset contains 4215 records mapping xyY colors to Munsell Hue and Chroma.

Mathematica is version 11.2 released in 2017 and Mathematica uses mxnet under the hood for deep learning tasks. On the Python side, I use the latest release with mxnet-mkl and I checked that MKLDNN is enabled.

Mathematica licence runs on a MS Surface Pro notebook with Windows 10, i7-7660U, 2.5Ghz, 2 cores, 4 hyperthreads, AVX2. I ran Python on this computer for comparison.

Here are the times for learning loops of 32768 epochs and

Batch Sizes:   128,   256,   512,  1024, 2048,  4096
Mathematica: 8m12s, 5m14s, 3m34s, 2m57s, 3m4s, 3m48s
PythonMxNet:  286m,  163m,   93m,   65m,  49m,   47m

I tried the mxnet environment variables optimization tricks suggested by Intel but only got 120% slower times.

I also switched all the arrays to float32 from float64 with the hypothesis that MKL could process 2 times as many operations in the same amount of time (excluding overhead of course) with SIMD registers but noticed not even a slight improvement.

The reason I switched my NN work from Mathematica to Python is I wanted to train the NN on different and more powerfull computers. And I also don't like having my notebook tied up on NN learning tasks.

How should I interpret those results?

What may be the cause of those performance differences?

Is there anything I can do to gain some performance under Python?

Or is this simply the unavoidable overhead imposed by the Python interpreter?

Edit:

The script for generating the NN:

def get_MunsellNet( Layers, NbInputs ):
   net = nn.HybridSequential()
   for l in range( len( Layers ) ):
      if l == 0:
         net.add( nn.Dense( Layers[ l ], activation = 'sigmoid', dtype = mu.DType, in_units = NbInputs ) )
      else:
         net.add( nn.Dense( Layers[ l ], activation = 'sigmoid', dtype = mu.DType ) )
   net.add( nn.Dense( 2, dtype = mu.DType ) )
   net.hybridize()
   net.initialize( mx.init.Uniform(), ctx = ctx )
   return net

The NN is created with this:

mu.DType = 'f8'
NbInputs = 3
Norm = 'None'   # Possible normalizer are: 'None', 'Unit', 'RMS', 'RRMS', 'Std'
Train_Dataset = mnr.Build_HCTrainData( NbInputs, Norm, Datasets = [ 'all.dat', 'fill.dat' ] )
Test_Dataset1 = mnr.Build_HCTestData( 'real.dat', NbInputs )
Test_Dataset2 = mnr.Build_HCTestData( 'test.dat', NbInputs )
Layers = [ 8, 16, 24, 8 ]
Net = mnn.get_MunsellNet( Layers, NbInputs )
Loss_Fn = mx.gluon.loss.L2Loss()
Learning_Rate = 0.0005
Optimizer = 'Adam'
Batch_Size = 4096
Epochs = 500000

And trained with this:

if __name__ == '__main__':
   global Train_Data_Loader
   Train_Data_Loader = mx.gluon.data.DataLoader( Train_Dataset, batch_size = Batch_Size, shuffle = True, num_workers = mnn.NbWorkers )
   Trainer = mx.gluon.Trainer( Net.collect_params(), Optimizer, {'learning_rate': Learning_Rate} )
   Estimator = estimator.Estimator( net = Net,
                                   loss = Loss_Fn,
                                trainer = Trainer,
                                context = mnn.ctx )
   LossRecordHandler = mnu.ProgressRecorder( Epochs, Test_Dataset1, Test_Dataset2, NbInputs, Net_Name, Epochs / 10 * 8 )
   for n in range( 10 ):
      LossRecordHandler.ResetStates()
      Train_Metric = Estimator.prepare_loss_and_metrics()
      Net.initialize( force_reinit = True )
      # ignore warnings for nightly test on CI only
      with warnings.catch_warnings():
         warnings.simplefilter( "ignore" )
         Estimator.fit( train_data = Train_Data_Loader,
                            epochs = Epochs,
                    event_handlers = [ LossRecordHandler ] )

Let me know if you need more code clips.

We will need to first ensure that this comparison is apple-to-apple, but while I know mxnet well, unfortunately I don't know enough about mathematica to offer suggestions on what to watch out for. Could you explain how you make sure the comparison setting is consistent (e.g. are they using the same mxnet build)? Also, would you mind sharing the python script that shows the model size? — szha, Aug 31 '20 at 04:14
I'm not sure what you mean by python script that shows the model size. I have a function that creates the network and a call to that functions with parameters. Is that what you want? Also should I post this here in the comments or down into "answer your own question" ? — Yves Poissant, Sep 01 '20 at 16:11
Concerning details about Mathematica uses of MxNet, I'm afraid this is not available information. But I would guess it uses a pretty old MxNet because the Mathematica v11.2 was released in 2017 and I never updated it. In any case, certainly a much older MxNet version that the one I used under Python. — Yves Poissant, Sep 01 '20 at 16:24
Thanks. It looks like I'm missing some modules in order to run this. Would you be able to check in the code somewhere so that I could run? — szha, Sep 05 '20 at 19:35
OK. I will see if I can post the whole application with test data on GitHub. I'll let you know. — Yves Poissant, Sep 08 '20 at 02:38

score 0 · Answer 1 · answered Aug 31 '20 at 04:18

In general, Python overhead can be large especially when the compute kernels are launched frequently with small inputs. From the benchmark above there is some evidence that a large chunk of it is due to the overhead. We see that while the batch size increases from 128 to 4096 (32x), the mxnet-to-mathematica time ratio decreases from ~35.75 to ~12.37, which can be seen as speed-up due to less invocation overhead. I will update the answer once there's more details to this question.

Puzzling mxnet performance under Python vs Mathematica

1 Answers1