Should my model with Monte Carlo dropout provide a mean prediction similar to the deterministic prediction?

Question

I have a model trained with multiple LayerNormalization layers, and I am unsure if a simple weight transfer works properly when activating dropout for prediction. This is the code I am using:

from tensorflow.keras.models import load_model, Model
from tensorflow.keras.layers import Dense, Dropout, LayerNormalization, Input

model0 = load_model(path + 'model0.h5')
OW = model0.get_weights()

inp = Input(shape=(10,))
D1 = Dense(760, activation='softplus')(inp)
DO1 = Dropout(0.29)(D1,training=True)
N1 = LayerNormalization()(DO1)
D2 = Dense(460,activation='softsign')(N1)
DO2 = Dropout(0.16)(D2,training=True)
N2 = LayerNormalization()(DO2)
D3 = Dense(664,activation='softsign')(N2)
DO3 = Dropout(0.09)(D3,training=True)
N3 = LayerNormalization()(DO3)
out = Dense(1,activation='linear')(N3)

mP = Model(inp,out)
mP.set_weights(OW)
mP.compile(loss='mse',optimizer='Adam')
mP.save(path + 'new_model.h5')

If I set training=False on the dropout layers, the model makes identical predictions to the original model. However, when the code is written as above the mean prediction is not close to the original/deterministic prediction.

Previous models that I had developed with dropout set to training had mean probabilistic predictions nearly identical to the deterministic model. Is there something I am doing incorrectly, or is this an issue with using LayerNormalization and active dropout? As far as I know, LayerNormalization has trainable parameters, so i didn't know if active dropout interferes with that. If it does, I am not sure how to remedy this.

This segment of code is for running a quick test and plotting the results:

inputs = np.zeros(shape=(1,10),dtype='float32')
inputsP = np.zeros(shape=(1000,10),dtype='float32')
opD = mD.predict(inputs)[0,0]
opP = mP.predict(inputsP).reshape(1000)
print('Deterministic: %.4f   Probabilistic: %.4f' % (opD,np.mean(opP)))

plt.scatter(0,opD,color='black',label='Det',zorder=3)
plt.scatter(0,np.mean(opP),color='red',label='Mean prob',zorder=2)
plt.errorbar(0,np.mean(opP),yerr=np.std(opP),color='red',zorder=2,markersize=0, capsize=20,label=r'$\sigma$ bounds')
plt.grid(axis='y',zorder=0)
plt.legend()
plt.tick_params(axis='x',labelsize=0,labelcolor='white',color='white',width=0,length=0)

And the resulting output and plot are shown below.

Deterministic: -0.9732 Probabilistic: -0.9011

uncertainty results

Will.Evo · Accepted Answer · 2021-02-15T17:11:56.023

Edit to my answer:

I think the problem is just an under-sampling from the model. The standard deviation of the predictions is directly tied to the dropout rate and thus the number of predictions you need to approximate the determistic model goes up as well. If you do an absurd test of the code below but with dropout set to 0.7 for each dropout layer, 100,000 samples is no longer enough to approximate the deterministic mean to within 10^-3 and the standard deviation of the predictions gets much larger.

import os

import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense, Dropout, Input

os.environ['CUDA_VISIBLE_DEVICES'] = '0'
GPUs = tf.config.experimental.list_physical_devices('GPU')
for gpu in GPUs:
    tf.config.experimental.set_memory_growth(gpu, True)

inp = Input(shape=(10,))
D1 = Dense(760, activation='softplus')(inp)
D2 = Dense(460, activation='softsign')(D1)
D3 = Dense(664, activation='softsign')(D2)
out = Dense(1, activation='linear')(D3)

mP = Model(inp, out)
mP.compile(loss='mse', optimizer='Adam')

inp = Input(shape=(10,))
D1 = Dense(760, activation='softplus')(inp)
DO1 = Dropout(0.29)(D1,training=False)
D2 = Dense(460, activation='softsign')(DO1)
DO2 = Dropout(0.16)(D2,training=True)
D3 = Dense(664, activation='softsign')(DO2)
DO3 = Dropout(0.09)(D3,training=True)
out = Dense(1, activation='linear')(DO3)

mP2 = Model(inp, out)
mP2.set_weights(mP.get_weights())
mP2.compile(loss='mse', optimizer='Adam')

data = np.zeros(shape=(100000, 10),dtype='float32')
res = mP.predict(data).reshape(data.shape[0])
res2 = mP2.predict(data).reshape(data.shape[0])

print (np.abs(res[0] - res2.mean()))

I agree with the theory there, but I ran that code multiple times and more often than not, the values were quite different. What version of tensorflow are you using? — WVJoe, Feb 13 '21 at 01:49
I am using version 1.15.0 for this test but yes after looking closer the std is still quite large at 0.18. I can get the mean to approximate the deterministic model to within 10^-3 error by increasing the number of predictions to 100,000 but the std stays around 0.18. If I increase the dropout rate of each dropout layer by 0.1, the mean is still approximated well, but the standard deviation increases to 0.24. Seems like you just need to sample from the model more to get the mean apprxomated well and the std is directly related to the dropout rate (makes sense I think). — Will.Evo, Feb 15 '21 at 17:04
I edited me answer. Without layer normalization, the main problem with the two models disagreeing is just an under-sampling from the probabilistic model given its standard deviation. If you increase the dropout rate, the standard deviation of the probabilistic predictions will increase and so too will the amount of samples needed to approximate the determistic result. — Will.Evo, Feb 15 '21 at 17:13

Should my model with Monte Carlo dropout provide a mean prediction similar to the deterministic prediction?

1 Answers1