I am a newbie in Batch Normalization and after some self-research, I am trying to implemented it with the help of ChatGPT. As it is in CNN, the input are 4-D array whose shape represented (batch size, height, width, channel). I initialize the scaling factor G and shifting factor B both as (1,1,1,channel) numpy array. However when I backpropagation, the derivative with respect to G and B are both in the shape (batch size, height, width, channel). When I broadcast it, the G and B turn into a (batch size, height, width, channel) shape. In which situation am I doing wrong.
Also I look at this answer regarding Batch Normalization and it seem the way I standardize it differ from it. Can you please check the implementation and help on my mistake?
Here is my code:
epsilon = 1e-5
for i in range(5):
Z1 = Conv1.convolve(example)
A1 = NN.activation("relu", Z1)
mean1 = np.mean(A1, axis=(0, 1, 2), keepdims=True)
var1 = np.var(A1,axis=(0,1,2),keepdims=True) + epsilon
std1 = np.sqrt(var1)
M1 = (A1 - mean1) / (std1 + epsilon)
N1 = M1 * G1 + B1
if i == 0:
Conv2.initialize(N1)
Z2 = Conv2.convolve(N1)
A2 = NN.activation("relu",Z2)
mean2 = np.mean(A2, axis=(0, 1, 2), keepdims=True) # Shape: (1, 1, 1, num_channels)
var2 = np.var(A2,axis=(0,1,2),keepdims=True) + epsilon
std2 = np.sqrt(var2) # Shape: (1, 1, 1, num_channels)
M2 = (A2 - mean2) / (std2)
N2 = M2 * G2 + B2
N3 = N2.reshape(N2.shape[0],-1).T
pred = NN.forward_propagation(N3,"v",return_=True)
# BACKPROPAGATION
dZ = NN.back_propagtaion(activation="Sigmoid",activation_prev=NN.parameter["A0"]
,m=20,layer=1,Y_true=yy,activation_cur=NN.parameter["A1"])
dNN = np.dot(dZ,NN.Weight[1])
dP = dNN.reshape(N2.shape)
# Derivative of scaling and shifting factor
dG2 = dP * M2
dB2 = np.sum(dP,axis=(0,1,2))
dZ2 = dP * G2
m = A2.shape[0]
dsigma2 = dZ2 * (A2 - mean2) * (-1 / (std2**2))
dmu2 = -np.sum(dZ2 * G2 / std2, axis=(0, 1, 2)) + dsigma2 * np.sum(-2 * (A2 - mean2), axis=(0, 1, 2)) / m
dback2 = dZ2 * G2 / std2 + dsigma2 * 2 * (A2 - mean2) / m + dmu2 / m
d_act2 = NN.activation_derivative("relu",Z2)
dZ2 = dback2 * d_act2
dK2,dW2,db2 = Conv2.backprop(dZ2)
# Derivative of scaling and shifting factor
dG1 = dK2 * M1
dB1 = np.sum(dK2,axis=(0,1,2))
dZ1 = dK2 * G1
dsigma1 = dZ1 * (A1 - mean1) * (-1 / (std1**2))
dmu1 = -np.sum(dZ1 * G1 / std1, axis=(0, 1, 2)) + dsigma1 * np.sum(-2 * (A1 - mean1), axis=(0, 1, 2)) / m
dback1 = dZ1 * G1 / std1 + dsigma1 * 2 * (A1 - mean1) / m + dmu1 / m
d_act1 = NN.activation_derivative("relu",Z1)
dZ1 = dback1 * d_act1
dK1,dW1,db1 = Conv1.backprop(dZ1)
lr = 1e-3
Conv2.K -= lr * dW2
Conv2.b -= lr * db2
Conv1.K -= lr * dW1
Conv1.b -= lr * db1
G2 -= lr * dG2
B2 -= lr * dB2
G1 -= lr * dG1
B1 -= lr * dB1