Which is the most appropriate Accuracy metric for multi-label classification when there is an imbalance between negative - positive values

Question

I will briefly explain my problem and the approaches I have tested so far.

I have a movie dataset and I am trying to predict 17 genres based on 4 columns (about actors, plot, content, reviews).

My target variable looks like this,

y_train=array([[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0],
       [0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

# Could be a problem that they are not float32 but int32?

y_test=array([[0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

As you can an array of boolean value may have up to three positive values. My current implementation has the following configuration:

Activation function of output layer: Sigmoid
Loss function: Binary_crossentropy
Metric function: Accuracy (binary since the loss function is binary croessentropy)

The results were very promising with 0.98 Accuracy level and 0.003 loss on training and validation dataset.

Learning curves No signs of overfit or underfit.

However, I thought that such very well fitted accuracy is due to the fact of many negative values. And the algorithm can predict very well the 0s and thus it achieves such high accuracy.

So I tried the following trials

1st trial
Activation function of output layer: Sigmoid
Loss function: categorical_cross_entropy
Metric function: categorical_accuracy

The results are much worse. Very high accuracy and a totally unrepresentative validation dataset with many spikes.

2nd trial
Activation function of output layer: Sigmoid
Loss function: sigmoid_focal_loss (link)
Metric function: categorical_accuracy

Way better loss improvement, with accuracy still being in a bad range of values. So I came to the conclusion that categorical accuracy is not my option.

3rd trial ( I changed categorical accuracy to AUC)
Activation function of output layer: Sigmoid
Loss function: sigmoid_focal_loss (link)
Metric function: tf.keras.metrics.AUC(multi_label=True)

3rd trial results on test dataset (movies never seen before by the neural network classifier)

"Test Score (evalution of the model's loss/error on the test sequences): 0.026287764310836792"
"Test Accuracy (evalution of the model's auc on the test sequences): 0.99942547082901"

Based on the results of each trial is still valid to assume that the model's metric is affected by the imbalance between 0, 1 target values? or the neural network with Adam optimizer is robust and generalized? I would like you to write your opinions on this matter.

[UPDATE]

Based on the comments, it was recommended to add class_weights produced the following error:

class_weights={0:1.0, 1:0.29}

Does Keras have any bug with the class weights argument?

Thanks a lot in advance.

[UPDATE] - 11.07.2020

I have decided to follow this plan:

Activation function of output layer: Sigmoid
Loss function: binary_crossentropy
Metric function: f1_score

I don't want to use the Accuracy metric since this is not an appropriate metric for classification with lots of negative classes compared to positive classes.

My model.compile() method looks like this

model_for_pruning.compile(optimizer='adam',
                          loss='binary_crossentropy',
                          metrics=[tfa.metrics.F1Score(y_train[0].shape[-1], average=None)])

However, I have a hard time to choose between F1_score micro, or the simple F1 score, since my data are multi-label. Based on my intuition micro average is more appropriate for multi-labeled data, but since I use sigmoid and binary_crossentropy I believe that no averaging shall be done in F1 score. Thus, I tried to put sample weights on my classes.

from sklearn.utils.class_weight import compute_sample_weight

class_weights_sample = compute_sample_weight('balanced',
                                             y_train)

fitted_model=model_for_pruning.fit([X_train_seq_actors, X_train_seq_plot, X_train_seq_features, X_train_seq_reviews],
                                           y_train,
                                           steps_per_epoch=int(np.ceil((X_train_seq_actors.shape[0]*optimizer_parameters['validation_split_ratio'])//hparams[HP_HIDDEN_UNITS])),
                                           epochs=fit_parameters["epoch"],
                                           batch_size=hparams[HP_HIDDEN_UNITS],
                                           validation_split=fit_parameters['validation_data_ratio'],
                                           callbacks=callbacks,
                                           use_multiprocessing=True,
                                           sample_weight=class_weights_sample
                                           )

Is this a typical correct approach or I miss something. Please note that I am asking about the approach validity and not if the code is running or not, because everything runs successfully.

I would try to calculate the class weights for each of your binary labels. Than, I would produce a dictionary to use it as class_weights in the fit function. In this scenario, you should adopt sigmoid as output function, binary cross entropy as loss, and binary accuracy as metric. This is, as far as I know, a very classical approach to unbalanced multi-label classification. Anyway, you could also check label-wise confusion matrixes to see performance for each one of them. — nsacco, Jul 10 '20 at 10:07
@nsacco thank you for the answer. It would be helpful if you could provide an approach of computing the class weights and adjust that on accuracy. Also I want to add that the confusion matrices are well computed based on binary accuracy scenario (initial approach) — NikSp, Jul 10 '20 at 10:18
@nsacco I believe there is the straigth forward way of using a dictionary like {0: 1., 1: 15}?...something like that and then feed this to the class_weights parameters of model.fit()...Although is there any more robust way to approximate those weights? — NikSp, Jul 10 '20 at 10:25

Which is the most appropriate Accuracy metric for multi-label classification when there is an imbalance between negative - positive values

0 Answers0