Dealing with noisy training labels in text classification using deep learning

Question

I have a dataset that comprises of sentences and corresponding multi-labels (e.g. a sentence can belong to multiple labels). Using a combination of Convolutional Neural Networks and Recurrent Neural Nets on language models (Word2Vec) I'm able to achieve a good accuracy. However, it's /too/ good at modelling the output, in the sense that a lot of labels are arguably wrong and thus the output too. This means that the evaluation (even with regularization and dropout) gives a wrong impression, since I have no ground truth. Cleaning up the labels would be prohibitively expensive. So I'm left to explore "denoising" the labels somehow. I've looked at things like "Learning from Massive Noisy Labeled Data for Image Classification", however they assume to learn some sort of noise covariace matrix on the outputs, which I'm not sure how to do in Keras.

Has anyone dealt with the problem of noisy labels in a mutli-label text classification setting before (ideally using Keras or similar) and has good ideas on how to learn a robust model with noisy labels?

cgnorthcutt · Accepted Answer · 2022-09-02T01:23:58.393

7

The cleanlab Python package, pip install cleanlab, for which I am an author, was designed to solve this task: https://github.com/cleanlab/cleanlab/. It's a professional package created for finding labels errors in datasets and learning with noisy labels. It works with any scikit-learn model out-of-the-box and can be used with PyTorch, FastText, Tensorflow, etc.

(UPDATED Sep 2022) I've added resources for exactly this task (text classification with noisy labels (labels that are sometimes flipped to other classes):

Blog: https://cleanlab.ai/blog/label-errors-text-datasets/|
Runnable Colab Notebook: https://docs.cleanlab.ai/stable/tutorials/text.html

Example -- Find label errors in your dataset.

from cleanlab.classification import CleanLearning
from cleanlab.filter import find_label_issues
from cleanlab.count import estimate_cv_predicted_probabilities

# OPTION 1 - 1 line of code for sklearn compatible models
issues = CleanLearning(sklearnModel, seed=SEED).find_label_issues(data, labels)

# OPTION 2 - 2 lines of code to use ANY model
#   just pass in out-of-sample predicted probabilities
pred_probs = estimate_cv_predicted_probabilities(data, labels)
ordered_label_issues = find_label_issues(
    labels=labels,
    pred_probs=pred_probs,
    return_indices_ranked_by='self_confidence',
)

Details on how to compute out-of-sample predicted probabilities with any model here.

Example -- Learning with Noisy Labels

Train an ML model on noisy labels like it was trained on perfect labels.

# Code taken from https://github.com/cleanlab/cleanlab
from sklearn.linear_model import LogisticRegression

# Learning with noisy labels in 3 lines of code.
cl = CleanLearning(clf=LogisticRegression())  # any sklearn-compatible classifier
cl.fit(X=train_data, labels=labels)
# Estimate the predictions you would have gotten training with error-free labels.
predictions = cl.predict(test_data)

Given that you also may be working with image classification and audio classification, here are working examples for Image Classification with PyTorch and Audio Classification with SpeechBrain.

Additional documentation is available here: docs.cleanlab.ai

edited Sep 02 '22 at 01:23

answered Dec 04 '18 at 01:25

cgnorthcutt

3,890
34
41

I was wondering if it is possible to flip the noisy labels in binary classifications, instead of removing them completely. – Sarah Oct 14 '20 at 04:17
You can, just note that if your model has low accuracy, this will introduce more error, and in a way that is biased by your model. If you do this iteratively, you can fall into a bad minima – cgnorthcutt Oct 15 '20 at 12:36
Thanks! One more question: by noisy labels in CL, do we mean random noises (e.g. someone labelled a cat as a dog just by mistake), or it also considers mislabeled data due to the difficulty of the object as noisy labels (e.g. it's hard to say the image is a cat or a dog and we probably select a wrong label)? – Sarah Oct 15 '20 at 16:01
1

@Sarah Neither, but much closer to the second than to random noise. CL models class-conditional noise. So that means, for every class, it learns the probability of it being mislabeled as any other class. This assumption is commonly used because it is reasonable. For example, in ImageNet, a "tiger" is more likely to be mislabeled "cheetah" than "flute." – cgnorthcutt Oct 15 '20 at 18:04
I am getting `X_train_cv, X_holdout_cv = X[cv_train_idx], X[cv_holdout_idx] TypeError: only integer scalar arrays can be converted to a scalar index`. – hafiz031 Oct 20 '21 at 04:25
@hafiz031 check that your labels are the integers 0, 1, 2… if they aren’t, map your labels to those integers first. Make sure you don’t skip one. For example, labels 0, 1, 3, 4 won’t work. (An update is coming to support these variations at the end of this year, but for now that should fix it) – cgnorthcutt Oct 21 '21 at 13:07
@cgnorthcutt I am using it to my existing `Fasttext` (from `Facebook`) module. In the `it(self, X, y)` `X` is a list of texts and `y` is a list of labels. As `Fasttext` wants a specific formated input like `__label__LABEL SENTENCE`, so it is done inside fit method and the data set is saved in a file and than passed to fasttext API everything inside fit method. `predict(self, X)` method on the other hand takes a list of texts and gives predictions as a list. To avoid skips in label as you mentioned, I applied label encoding (from `scikit-learn`) method beforehand but still getting the error. – hafiz031 Oct 23 '21 at 03:05
Hi @cgnorthcutt, I have checked the labels, there is no gap in sequence, There are `40` labels and all the labels are within `[0, 39]` and each of the values from `0` to `39` exists at least once. The data type of each label is ``. They are generated by `LabelEncoder`. But please note that my `X` is not numeric, it is a list of texts, it is converted to numeric form inside `fit()` method. – hafiz031 Oct 24 '21 at 02:59
1

@hafiz031 hmm this is odd. let's move the discussion to https://github.com/cleanlab/cleanlab/issues - can you post the issue here? Cleanlab supports fasttext and it should work for you. Here's an example with the amazon reviews dataset and instructions for fasttext: https://github.com/cleanlab/cleanlab/tree/master/examples/amazon_reviews_dataset – cgnorthcutt Oct 24 '21 at 14:56
@cgnorthcutt Yeah this should work and I think this is happening for my mistake. I shall dig down the issue further. If I get any concrete reason, then I shall post it. – hafiz031 Oct 25 '21 at 06:39
is 's=train_noisy_labels' just the y_train labels? – Maths12 Nov 01 '21 at 14:43
`s` is the labels of the dataset, same as `y_train`. We use `s` to clarify that the labels can be noisy whereas `y` typically assumes error-free labels. @Maths12 – cgnorthcutt Nov 02 '21 at 18:53

Dealing with noisy training labels in text classification using deep learning

1 Answers1

Example -- Find label errors in your dataset.

Example -- Learning with Noisy Labels

Train an ML model on noisy labels like it was trained on perfect labels.