How does one tackle over-fitting in NLP based CNN models for multiclass text classification with word embeddings?

Question

(Problem: Overfitting issues in a multiclass text classification problem)
In my personal project, the objective is to classify the industry tags of a company based on the company description. The steps I've taken are:

Removing stopwords, punctuations, spaces, etc, and splitting the description into tokens.
Converted the labels and tokens into word vectors.
Convert the tokens into a word embedding model.
Set up the CNN with 62 output nodes. (62 distinct industry tags to classify)

Image/Dataset Link for reference: https://drive.google.com/drive/folders/1yLW2YepoHvSp_koHDDzcAAJBIaYQIen0?usp=sharing

The issue I face is that the model overfits regardless of the alterations I make. (Ends early due to callback I set up for loss) [CNN accuracy][7]

max_features = 700
maxlen = 200
embedding_dims = 50
filters = 200
kernel_size = 3
hidden_dims = 160
es_callback = EarlyStopping(monitor='val_loss', patience=5)

model = Sequential()
model.add(Embedding(vocab_size, 100, weights=[embedding_matrix], trainable=False))
model.add(Dropout(0.4))

model.add(Conv1D(filters,
                 kernel_size,
                 padding='valid',
                 activation='relu',
                 strides=1))
model.add(GlobalMaxPooling1D())

model.add(Dense(hidden_dims))
model.add(Dropout(0.4))
model.add(Activation('relu'))

model.add(Dense(62))
model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

history = model.fit(X_train, y_label_train,
              batch_size=64,
              epochs=50,
              validation_data=(X_test, y_label_test),
              callbacks=[es_callback])

Code Link: https://colab.research.google.com/drive/1YqbhPX6e4bJ5FnbfHj7fTQH4tUe6diva?usp=sharing

dimi_fn · Answer 1 · 2021-02-27T02:20:14.827

I find the question quite general, although it is a useful one not only for NLP. Therefore, I think there are a lot of topics to cover in order to tackle this issue properly. I am advising you to firstly focus on the data pre-processing steps, hence the code snippet here describing the neural network architecture (in my opinion) might be the last step. Based on a first look at your code at colab I would advise:

Better data pre-processing
- More correct data preprocessing:
  - For example, I think you applied the following only on the train data X['Text_Clean'] = X['Business Description'].apply(lambda x: remove_punct(x)), and I didn't see a pipeline for similar transformations on the test set, nor a direct transformation.
  - You applied word embeddings on the tokens. I believe they would give greater value to your text "text_final" feature so that you could gain value from the semantic representations of the text narrative.
- In general I would say: convert to lower-case, removal of HTML tags, punctuation, non-alphabetic characters, removal of stopwords/adding some other words in the stopwords list based on your specific text, fixing some informal text in the vocabulary, e.g. "what's" -> "what is", and stemming to convert words with roughly the same semantics to one standard form.
- Imbalanced Classification Problems: It occurs

when the number of examples in the training dataset for each class label is not balanced.

Put it simply, when a small number of your tags occupy a large number of your labels associated with your text, then not much training happens on the rest of the "minority" tags to be learned.

Feature Engineering: You can create extra features and metadata to enhance training and learning. For example, you can add a column providing the sentiment of every instance, and/or you can apply topic modeling as an extra attribute (they would be similar to the "tokens" you have put on the dataframe, supporting -and not replacing- the main text attribute).

Lastly, I would not consider it a bad idea to firstly begin with TfidfVectorizer and observe the accuracy there before proceeding to neural networks. If the above is not enough, you can also explore more robust transfer learning and pre-trained models while using deep neural networks.

How does one tackle over-fitting in NLP based CNN models for multiclass text classification with word embeddings?

1 Answers1