I have a dataset that looks like this:
df.head(5)
data labels
0 [0.0009808844009380855, 0.0008974465127279559] 1
1 [0.0007158940267629654, 0.0008202958833774329] 3
2 [0.00040971929722210984, 0.000393972522972382] 3
3 [7.916243163372941e-05, 7.401835468434177e243] 3
4 [8.447556379936086e-05, 8.600626393842705e-05] 3
The 'data' column is my X and the labels is y. The df has 34890 rows. Each row contains 2 floats. The data represents a bunch of sequential text and each observation is a representation of a sentence. There are 5 classes.
I am training it on this LSTM code:
data = df.data.values
labels = pd.get_dummies(df['labels']).values
X_train, X_test, y_train, y_test = train_test_split(data,labels, test_size = 0.10, random_state = 42)
X_train = X_train.reshape((X_train.shape[0],1,X_train.shape[1])) # shape = (31401, 1, 5)
X_test = X_test.reshape((X_test.shape[0],1,X_test.shape[1])) # shape = (3489, 1, 5)
### y_train shape = (31401, 5)
### y_test shape = (3489, 5)
### Bi_LSTM
Bi_LSTM = Sequential()
Bi_LSTM.add(layers.Bidirectional(layers.LSTM(32)))
Bi_LSTM.add(layers.Dropout(.5))
# Bi_LSTM.add(layers.Flatten())
Bi_LSTM.add(Dense(11, activation='softmax'))
def compile_and_fit(history):
history.compile(optimizer='rmsprop',
loss='categorical_crossentropy',
metrics=['accuracy'])
history = history.fit(X_train,
y_train,
epochs=30,
batch_size=32,
validation_data=(X_test, y_test))
return history
LSTM_history = compile_and_fit(Bi_LSTM)
The model trains, but the val accuracy is fixed at 53% for every epoch. I am assuming this is because of my class imbalance problem (1 class takes up ~53% of the data, the other 4 are somewhat evenly distributed throughout the remaining 47%).
How do I balance my data? I am aware of typical over/under sampling techniques on non-time series data, but I can't over/under sample because that would mess with the sequential time-series nature of the data. Any advice?
EDIT
I am attempting to use the class_weight argument in Keras to address this. I am passing this dict into the class_weight argument:
class_weights = {
0: 1/len(df[df.label == 1]),
1: 1/len(df[df.label == 2]),
2: 1/len(df[df.label == 3]),
3: 1/len(df[df.label == 4]),
4: 1/len(df[df.label == 5]),
}
Which I am basing off of this recommendation:
However, the acc/loss is now really awful. I get ~30% accuracy with a dense net, so I expected the LSTM to be an improvement. See acc/loss curves below: