0

I have a dataset df for a multiclass classification problem. I have a huge class imbalance. Namely, grade_F and grade_G.

>>> percentage = 1. / df['grade'].value_counts(normalize=True)
>>> print(percentage )

B    0.295436
C    0.295362
A    0.204064
D    0.136386
E    0.048788
F    0.014684
G    0.005279

At the same time I have I have very innacurrate predictions for less represented classes, as one can see here.

I have a neural network with an output dimension of 7. I mean the array I want to predict is :

>>> print(y_train.head())
        grade_A  grade_B  grade_C  grade_D  grade_E  grade_F  grade_G
689526        0        1        0        0        0        0        0
523913        1        0        0        0        0        0        0
266122        0        0        1        0        0        0        0
362552        0        0        0        1        0        0        0
484987        1        0        0        0        0        0        0
...

So I tried the following neural network:

from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.constraints import maxnorm

def create_model(input_dim, output_dim):
    print(output_dim)
    # create model
    model = Sequential()
    # input layer
    model.add(Dense(100, input_dim=input_dim, activation='relu', kernel_constraint=maxnorm(3)))
    model.add(Dropout(0.2))

    # hidden layer
    model.add(Dense(60, activation='relu', kernel_constraint=maxnorm(3)))
    model.add(Dropout(0.2))

    # output layer
    model.add(Dense(output_dim, activation='softmax'))

    # Compile model
    model.compile(loss='categorical_crossentropy', loss_weights=lossWeights, optimizer='adam', metrics=['accuracy'])
    return model

from keras.callbacks import ModelCheckpoint
from keras.models import load_model

model = create_model(x_train.shape[1], y_train.shape[1])

epochs =  35
batch_sz = 64

print("Beginning model training with batch size {} and {} epochs".format(batch_sz, epochs))

checkpoint = ModelCheckpoint("lc_model.h5", monitor='val_acc', verbose=0, save_best_only=True, mode='auto', period=1)
# train the model
history = model.fit(x_train.as_matrix(),
                y_train.as_matrix(),
                validation_split=0.2,
                epochs=epochs,  
                batch_size=batch_sz, # Can I tweak the batch here to get evenly distributed data ?
                verbose=2,
                callbacks=[checkpoint])

# revert to the best model encountered during training
model = load_model("lc_model.h5")

So I fed a vector of weights inversely proportional to class frequency:

lossWeights = df['grade'].value_counts(normalize=True)
lossWeights = lossWeights.sort_index().tolist()

However it told me that the output was of size 1 :

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-66-bf262c22c9dc> in <module>
      2 from keras.models import load_model
      3 
----> 4 model = create_model(x_train.shape[1], y_train.shape[1])
      5 
      6 epochs =  35

<ipython-input-65-9290b177bace> in create_model(input_dim, output_dim)
     19 
     20     # Compile model
---> 21     model.compile(loss='categorical_crossentropy', loss_weights=lossWeights, optimizer='adam', metrics=['accuracy'])
     22     return model

C:\ProgramData\Anaconda3\lib\site-packages\keras\engine\training.py in compile(self, optimizer, loss, metrics, loss_weights, sample_weight_mode, weighted_metrics, target_tensors, **kwargs)
    178                                  'The model has ' + str(len(self.outputs)) +
    179                                  ' outputs, but you passed loss_weights=' +
--> 180                                  str(loss_weights))
    181             loss_weights_list = loss_weights
    182         else:

ValueError: When passing a list as loss_weights, it should have one entry per model output. The model has 1 outputs, but you passed loss_weights=[4.9004224502112255, 3.3848266392035704, 3.385677583130476, 7.33212052000478, 20.49667767920116, 68.10064134188455, 189.42024013722127]
Revolucion for Monica
  • 2,848
  • 8
  • 39
  • 78

2 Answers2

3

loss_weights does not weight different classes, it weights different outputs. Your model has only one output. Yes, that output is a list, but it is still treated as a single entity by keras.

A model made with the functional API can have multiple outputs, each with its own loss function. While training the model, the loss is then defined to be the sum of all the loss functions applied to their respective outputs. In that case, loss_weights can be used to weight the different outputs.

However, I do not believe it is useful for what you want to do.

The Guy with The Hat
  • 10,836
  • 8
  • 57
  • 75
2

What you are looking for is class_weight in the fit function.

weights = {0: 1 / 0.204064,
           1: 1 / 0.295436, 
           2: 1 / 0.295362,
           3: 1 / 0.136386, 
           4: 1 / 0.048788,
           5: 1 / 0.014684,
           6: 1 / 0.005279}

You may want to reduce their sizes, because you have weights from about 3 to 200 there, but the relation is more important.

Then:

model.fit(....
          class_weight = weights, 
         )
Daniel Möller
  • 84,878
  • 18
  • 192
  • 214
  • Thanks ! Yes, that something I was considering with trying to work on the features. But how should I reduce their sizes properly ? And what did you meant when saying that the releation is more important ? – Revolucion for Monica Sep 17 '19 at 19:28
  • It's important that `weight6 / weight0 = 0.005279 / 0.204064`, but they could be any value. You can divide all of them by 3, by 10, etc, no problem. They being too big might increase your learning rate, that's why maybe it's interesting to reduce them a bit, but testing or simply adjusting the learning rate are good ideas. – Daniel Möller Sep 18 '19 at 13:06