I compared mxnet performance between Mathematica and Python and observe more than an order of magnitude performance differences and would like advises on how to improve performance under Python.
My NN is an MLP for regression, with 3 float inputs, 8, 16, 24, 8 neurons fully connected layers and 2 float output, Sigmoid is used everywhere except on the input and output neurons. The optimizer used in Mathematica is Adam so I used this too in Python with the same parameters. The training dataset contains 4215 records mapping xyY colors to Munsell Hue and Chroma.
Mathematica is version 11.2 released in 2017 and Mathematica uses mxnet under the hood for deep learning tasks. On the Python side, I use the latest release with mxnet-mkl and I checked that MKLDNN is enabled.
Mathematica licence runs on a MS Surface Pro notebook with Windows 10, i7-7660U, 2.5Ghz, 2 cores, 4 hyperthreads, AVX2. I ran Python on this computer for comparison.
Here are the times for learning loops of 32768 epochs and
Batch Sizes: 128, 256, 512, 1024, 2048, 4096
Mathematica: 8m12s, 5m14s, 3m34s, 2m57s, 3m4s, 3m48s
PythonMxNet: 286m, 163m, 93m, 65m, 49m, 47m
I tried the mxnet environment variables optimization tricks suggested by Intel but only got 120% slower times.
I also switched all the arrays to float32 from float64 with the hypothesis that MKL could process 2 times as many operations in the same amount of time (excluding overhead of course) with SIMD registers but noticed not even a slight improvement.
The reason I switched my NN work from Mathematica to Python is I wanted to train the NN on different and more powerfull computers. And I also don't like having my notebook tied up on NN learning tasks.
How should I interpret those results?
What may be the cause of those performance differences?
Is there anything I can do to gain some performance under Python?
Or is this simply the unavoidable overhead imposed by the Python interpreter?
Edit:
The script for generating the NN:
def get_MunsellNet( Layers, NbInputs ):
net = nn.HybridSequential()
for l in range( len( Layers ) ):
if l == 0:
net.add( nn.Dense( Layers[ l ], activation = 'sigmoid', dtype = mu.DType, in_units = NbInputs ) )
else:
net.add( nn.Dense( Layers[ l ], activation = 'sigmoid', dtype = mu.DType ) )
net.add( nn.Dense( 2, dtype = mu.DType ) )
net.hybridize()
net.initialize( mx.init.Uniform(), ctx = ctx )
return net
The NN is created with this:
mu.DType = 'f8'
NbInputs = 3
Norm = 'None' # Possible normalizer are: 'None', 'Unit', 'RMS', 'RRMS', 'Std'
Train_Dataset = mnr.Build_HCTrainData( NbInputs, Norm, Datasets = [ 'all.dat', 'fill.dat' ] )
Test_Dataset1 = mnr.Build_HCTestData( 'real.dat', NbInputs )
Test_Dataset2 = mnr.Build_HCTestData( 'test.dat', NbInputs )
Layers = [ 8, 16, 24, 8 ]
Net = mnn.get_MunsellNet( Layers, NbInputs )
Loss_Fn = mx.gluon.loss.L2Loss()
Learning_Rate = 0.0005
Optimizer = 'Adam'
Batch_Size = 4096
Epochs = 500000
And trained with this:
if __name__ == '__main__':
global Train_Data_Loader
Train_Data_Loader = mx.gluon.data.DataLoader( Train_Dataset, batch_size = Batch_Size, shuffle = True, num_workers = mnn.NbWorkers )
Trainer = mx.gluon.Trainer( Net.collect_params(), Optimizer, {'learning_rate': Learning_Rate} )
Estimator = estimator.Estimator( net = Net,
loss = Loss_Fn,
trainer = Trainer,
context = mnn.ctx )
LossRecordHandler = mnu.ProgressRecorder( Epochs, Test_Dataset1, Test_Dataset2, NbInputs, Net_Name, Epochs / 10 * 8 )
for n in range( 10 ):
LossRecordHandler.ResetStates()
Train_Metric = Estimator.prepare_loss_and_metrics()
Net.initialize( force_reinit = True )
# ignore warnings for nightly test on CI only
with warnings.catch_warnings():
warnings.simplefilter( "ignore" )
Estimator.fit( train_data = Train_Data_Loader,
epochs = Epochs,
event_handlers = [ LossRecordHandler ] )
Let me know if you need more code clips.