Do I have to have the same class number for training dataset, validation dataset and test dataset?

Question

When I download the dataset for resnet model, the data file shows me 33 classes for training and 6 classes for validation. But when I compile it, it reports the class number is wrong. Like the following code:

resnet_model = Sequential()
pretrained_model = tf.keras.applications.ResNet50(include_top=False, 
                                                  input_shape=(224,224,3),
                                                  pooling='avg', 
                                                  classes = 33, 
                                                  weights = 'imagenet')

for layer in pretrained_model.layers: 
    layer.trainable=False
resnet_model.add(pretrained_model)
resnet_model.add(Flatten())
#resnet_model.add(Dense(512,activation='relu'))
resnet_model.add(Dense(33,activation='softmax'))
resnet_model.compile(optimizer=Adam(learning_rate=0.001),loss='categorical_crossentropy',metrics=['accuracy'])
epochs = 3
history= resnet_model.fit(
        trains_ds,
        validation_data=val_ds,
        epochs=epochs)

The error shows： Shapes (None, 33) and (None, 6) are incompatible

Do I have to have the same classes number for training dataset and validation dataset? If I got 33 classes for training and 6 classes for validation, I need to create another 27 classes pictures for validation dataset. Then I can fit it, is that right?

You already asked this question yesterday, and it was closed, only ask questions once, and do not ask off-topic questions (this is not a programming problem). — Dr. Snoopy, Aug 03 '22 at 08:32

score -1 · Accepted Answer · answered Aug 03 '22 at 05:35

Yes, you have to have the same "number of classes", or more properly, the one-hot encoding of your labels has to have the same shape. In this case (None, 33).

To understand why, let's digress a little bit on one-hot encoding. This is a process by which categorical variables (your labels 0, 1, 2, ..., 32) are converted into another form, better suited for ML algorithms.

Example: for simplicity let's assume you only have 4 classes. Your dataset probably has this labels: 0, 1, 2, 3. With one-hot encoding it is created a vector of shape 4 of all zeros, except for a 1, positioned so that it can encode the label. In this case:

0 -> [1 0 0 0]
1 -> [0 1 0 0]
2 -> [0 0 1 0]
3 -> [0 0 0 1]

Now let's get back at your case. Your network correctly gives error because it materially cannot handle different shapes. However, there is also a logical reason why this is not allowed. Ambiguity. If we encode a label, for instance label 2 in one-hot encoding on 33 classes or on 6 classes, the vector changes. But how can the network be able to match the two vectors? How could it figure out that you are referring to the same class?

There is also another aspect to consider, why would you want to validate your training on a reduced set of classes? Validation is meant to monitor the performance of your training on a diverse set of images that are not affecting the training. But if you use a narrow set of classes with respect to train classes you will end up validating only that subset, totally neglecting the others.

So, to answer the question, you could use the images you currently have for validation, modifying the one-hot encoding so that it had the same shape as train's, but that wouldn't make much sense. It would be recommended to introduce images from all classes.

Do I have to have the same class number for training dataset, validation dataset and test dataset?

1 Answers1