(Problem: Overfitting issues in a multiclass text classification problem)
In my personal project, the objective is to classify the industry tags of a company based on the company description. The steps I've taken are:
- Removing stopwords, punctuations, spaces, etc, and splitting the description into tokens.
- Converted the labels and tokens into word vectors.
- Convert the tokens into a word embedding model.
- Set up the CNN with 62 output nodes. (62 distinct industry tags to classify)
Image/Dataset Link for reference: https://drive.google.com/drive/folders/1yLW2YepoHvSp_koHDDzcAAJBIaYQIen0?usp=sharing
The issue I face is that the model overfits regardless of the alterations I make. (Ends early due to callback I set up for loss) [CNN accuracy][7]
max_features = 700
maxlen = 200
embedding_dims = 50
filters = 200
kernel_size = 3
hidden_dims = 160
es_callback = EarlyStopping(monitor='val_loss', patience=5)
model = Sequential()
model.add(Embedding(vocab_size, 100, weights=[embedding_matrix], trainable=False))
model.add(Dropout(0.4))
model.add(Conv1D(filters,
kernel_size,
padding='valid',
activation='relu',
strides=1))
model.add(GlobalMaxPooling1D())
model.add(Dense(hidden_dims))
model.add(Dropout(0.4))
model.add(Activation('relu'))
model.add(Dense(62))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics=['accuracy'])
history = model.fit(X_train, y_label_train,
batch_size=64,
epochs=50,
validation_data=(X_test, y_label_test),
callbacks=[es_callback])
Code Link: https://colab.research.google.com/drive/1YqbhPX6e4bJ5FnbfHj7fTQH4tUe6diva?usp=sharing