Optimal ways to calculate class weights for large datasets

Question

I wanted to know about optimal ways to calculate class weights for large datasets built with the tf.data API. Consider the official TensorFlow tutorial on handling imbalance datasets - https://www.tensorflow.org/tutorials/structured_data/imbalanced_data.

If I were to scale it to a fairly moderate sized image dataset what should be my strategy to calculate the class weights, calculate the initial bias, and so on?

score 1 · Answer 1 · answered Jan 14 '21 at 16:24

If by class weight you mean the dictionary to be used in model.fit the code below will return the class weight dictionary.

import numpy as np
def class_weight_calc(class_id_list, class_freq_list):
    class_weight={}
    total=0
    for num in class_freq_list:
        total += num
    smallest=np.Inf
    for  klass, count in zip(class_id_list, class_freq_list):
        class_weight[klass]=total/count
        if class_weight[klass]<smallest:
            smallest=class_weight[klass]
    for c in class_id_list:
        class_weight[c]=class_weight[c]/smallest
    return class_weight

Note class_id_list is a list of your class indices. Class_freq_list is a corresponding list of how many samples there are for each class. For example if you have 3 classes the the class_id_list=[0,1,2]. If there are 10 samples for class 0, 20 samples for class 1 and 40 samples for class 2 then Class_freq_list=[10,20,40]. With these values the function would return a class_weight={0:4.0, 1:2.0, 2:1.0}

Optimal ways to calculate class weights for large datasets

1 Answers1