How does the gradient descent work when the one hot encoded labels are all zeros?

Question

I've got a dataset of images of the vineyard taken from above I'd like to sub-classify, there are 3 big classes [ground,vegetal,vineyard] and some of them are divided into several subclasses : vegetal:[grass,flower], vineyard:[healthy, disease A, disease B]. I'd like to design a neural network with three outputs, the problem is that for some class or subclass (e.g. grass) some outputs might be irrelevant (e.g. the disease output). This is not a problem for inference but in order to train my model, I'd like to find the right (multi-)labels so that, for instance, when computing the error on a grass sample, I don't want the weights which are responsible only for the disease ouput to be updated.

The idea I had is to give for each class 3 one hot vectors which will be zero if the classification output is irrelevant, for instance a healthy vineyard sample will be encoded as [[0,0,1],[0,0],[1,0,0]] and a ground sample [[1,0,0],[0,0],[0,0,0]]. I thought this would work because, when computing the error's gradient of the disease/vegetal output for a ground sample, it should be zero.

mathematical explanation

However, it looks like that the gradient of the error computed with "zeros one hot labels" (sorry for the name...) is actually non zero.

import tensorflow as tf
import numpy as np
input_layer=tf.keras.layers.Input(shape=(128,128,3))
x=tf.keras.layers.Conv2D(8,kernel_size=(2,2),activation='relu')(input_layer)
x=tf.keras.layers.MaxPool2D(pool_size=(2,2))(x)
x=tf.keras.layers.Conv2D(16,kernel_size=(2,2),activation='relu')(x)
x=tf.keras.layers.MaxPool2D(pool_size=(2,2))(x)
x=tf.keras.layers.Conv2D(32,kernel_size=(2,2),activation='relu')(x)
x=tf.keras.layers.MaxPool2D(pool_size=(2,2))(x)
x=tf.keras.layers.Conv2D(32,kernel_size=(2,2),activation='relu')(x)
x=tf.keras.layers.MaxPool2D(pool_size=(2,2))(x)
x=tf.keras.layers.Conv2D(32,kernel_size=(2,2),activation='relu')(x)
x=tf.keras.layers.GlobalAveragePooling2D()(x)
output=tf.keras.layers.Dense(3,activation='softmax')(x)
model=tf.keras.Model(inputs=input_layer,outputs=output)
model.compile('sgd','binary_crossentropy')

X=np.random.randint(0,255,(1,128,128,3))
Y=np.zeros((1,3))

old_weights=model.layers[-1].get_weights()
model.train_on_batch(X,Y)
new_weights=model.layers[-1].get_weights()

print old_weights[0]-new_weights[0]

gives me a non zero tensor, can anyone explain me what is happening here?

How does the gradient descent work when the one hot encoded labels are all zeros?

0 Answers0