How to fix class imbalance in dialogue (text) time series data?

Question

I have a dataset that looks like this:

df.head(5)


 data                                                     labels
0  [0.0009808844009380855, 0.0008974465127279559]             1
1  [0.0007158940267629654, 0.0008202958833774329]             3
2  [0.00040971929722210984, 0.000393972522972382]             3
3  [7.916243163372941e-05, 7.401835468434177e243]             3
4  [8.447556379936086e-05, 8.600626393842705e-05]             3

The 'data' column is my X and the labels is y. The df has 34890 rows. Each row contains 2 floats. The data represents a bunch of sequential text and each observation is a representation of a sentence. There are 5 classes.

I am training it on this LSTM code:

data = df.data.values
labels = pd.get_dummies(df['labels']).values

X_train, X_test, y_train, y_test = train_test_split(data,labels, test_size = 0.10, random_state = 42)

X_train = X_train.reshape((X_train.shape[0],1,X_train.shape[1])) # shape = (31401, 1, 5)
X_test = X_test.reshape((X_test.shape[0],1,X_test.shape[1])) # shape = (3489, 1, 5)
### y_train shape =  (31401, 5)
### y_test shape =  (3489, 5)

### Bi_LSTM
Bi_LSTM = Sequential()
Bi_LSTM.add(layers.Bidirectional(layers.LSTM(32)))
Bi_LSTM.add(layers.Dropout(.5))
# Bi_LSTM.add(layers.Flatten())
Bi_LSTM.add(Dense(11, activation='softmax'))

def compile_and_fit(history):

    history.compile(optimizer='rmsprop',
                  loss='categorical_crossentropy',
                  metrics=['accuracy'])

    history = history.fit(X_train,
              y_train,
              epochs=30,
              batch_size=32,
              validation_data=(X_test, y_test))

    return history

LSTM_history = compile_and_fit(Bi_LSTM)

The model trains, but the val accuracy is fixed at 53% for every epoch. I am assuming this is because of my class imbalance problem (1 class takes up ~53% of the data, the other 4 are somewhat evenly distributed throughout the remaining 47%).

How do I balance my data? I am aware of typical over/under sampling techniques on non-time series data, but I can't over/under sample because that would mess with the sequential time-series nature of the data. Any advice?

EDIT

I am attempting to use the class_weight argument in Keras to address this. I am passing this dict into the class_weight argument:

class_weights = {
    0: 1/len(df[df.label == 1]),
    1: 1/len(df[df.label == 2]),
    2: 1/len(df[df.label == 3]),
    3: 1/len(df[df.label == 4]),
    4: 1/len(df[df.label == 5]),
}

Which I am basing off of this recommendation:

https://stats.stackexchange.com/questions/342170/how-to-train-an-lstm-when-the-sequence-has-imbalanced-classes

However, the acc/loss is now really awful. I get ~30% accuracy with a dense net, so I expected the LSTM to be an improvement. See acc/loss curves below:

Try assigning the weight to class/label. You can try for focal loss also. — Pygirl, May 05 '20 at 18:30
Thank you but I don't understand what you mean. Can you elaborate on both of your suggestions? — connor449, May 05 '20 at 18:34
cell no. 15 and 19 -> https://www.kaggle.com/hirayukis/lightgbm-keras-and-4-kfold — Pygirl, May 05 '20 at 18:36
https://stats.stackexchange.com/questions/342170/how-to-train-an-lstm-when-the-sequence-has-imbalanced-classes — Pygirl, May 05 '20 at 18:37
@Pygirl thanks for the link, that was helpful. I have tried to implement this, but the training is very bad. I supplied the code of my implementation. Is this how you would do it? — connor449, May 05 '20 at 19:43
you need more features and a better architecture. Also if possible try to use adam instead of rmsprop. — Pygirl, May 06 '20 at 02:51

score 1 · Answer 1 · answered May 05 '20 at 18:41

Keras/Tensorflow enable to use class_weight or sample_weights in model.fit method

class_weight affects the relative weight of each class in the calculation of the objective function. sample_weights, as the name suggests, allows further control of the relative weight of samples that belong to the same class

class_weight accepts a dictionary where you compute the weights of each class while sample_weights receive a univariate array of dim == len(y_train) where you assign specific weight to each sample

How to fix class imbalance in dialogue (text) time series data?

EDIT

1 Answers1