Understanding ConvNet Prediction on Text Classification

Question

I'm trying to debug a model that uses 1D convolutions to classify text that was labeled by humans as being "appropriate" vs "not appropriate" to be posted on some website. Looking at false positives (wrongly predicted "appropriate"), I see that the text has mostly neutral/positive sounding words, but the idea conveyed is bad (ex: talking about "capping population"). To address a case like this, I can think of ways to help the model learn that the subject of capping population (in this example) should not be classified as "appropriate" for this particular task.

The problem I'm having is understanding what caused the model to predict "not appropriate" for messages that are in fact appropriate. For example, the following message should be considered "appropriate":

"The blame lies with the individual who commits the crime."

The model thinks that's not appropriate, but according to the labeling criteria of the dataset, that's a valid message.

Question

Given a model with an embedding layer for each word, followed by several 1D convs + dense layer, what are some techniques that can help me what is causing the model to classify that message as such, and potential ways to help the model learn that's ok?

Update

Turns out if I take the example phrase above and replace one word at a time, then see how the model classifies the resulting phrase, it classifies the phrase as being "appropriate" when I replace the word "lies" with just about any other "positive" or "neutral" word. So it seems like the model learned that "lies" is a really, really bad word. Question is: how do I create a feature(s) or otherwise help the model generalize beyond that?

score 1 · Answer 1 · answered Jan 05 '21 at 11:50

Maybe in the dataset used to train the model, most of the texts containing the word "lies" (and "related" expressions) were labeled as "not appropriate" by humans, and there wasn't enough examples of "appropriate" usages (e.g. "lies are bad", "avoid spreading misinformation")

It could also be the case that many of the examples were related to the meaning "false statement" and not as many were related to other meanings.

These are some reasons I can think of for it to learn that texts containing "lies" are more likely to be "not appropriate".

Understanding ConvNet Prediction on Text Classification

Question

Update

1 Answers1