Pre-Trained models for text Classification

Question

So I have few words without labels but I need to classify them into 4-5 categories. I can visibly say that this test set can be classified. Although I do not have training data so I need to use a pre-trained model to classify these words. Which model is good for this paradigm and on which dataset has it already been trained?

Thanks

categories like artifacts, animals, food and birds(if you may) — scifi_bot, Dec 12 '20 at 09:43
Unless the data comes from some well known dataset, there's very little chance you'll find a pre-trained model specific to your data and categories. You could try clustering, but it won't produce exactly the categories that you expect. — Erwan, Dec 16 '20 at 21:53

score 8 · Answer 1 · answered Feb 07 '21 at 12:00

The task we are taking about is called Zero-Shot Topic Classification - predicting a topic that the model has not been trained on. This paradigm is supported by Hugging Face library, you can read more here. The most common pre-trained model is Bart Large MNLI - the checkpoint for bart-large after being trained on the MNLI dataset. Here is a simple example, showing the classification of phrase "I like hot dogs" without any preliminary training:

First of all, please install the transformers library:
```
pip install --upgrade transformers
```

Then import and initialize the pipeline:

from transformers import pipeline

classifier = pipeline('zero-shot-classification', model='facebook/bart-large-mnli')

Enter our toy dataset:

 labels = ["artifacts", "animals", "food", "birds"]
 hypothesis_template = 'This text is about {}.'
 sequence = "I like hot dogs"

Predict the label:

prediction = classifier(sequence, labels, hypothesis_template=hypothesis_template, multi_class=True)

print(prediction)

The output will be somethng like

`{'sequence': 'i like hot dogs', 
'labels': ['food', 'animals', 'artifacts', 'birds'], 
'scores': [0.9971900582313538, 0.00529429130256176, 0.0020991512574255466, 
0.00023589911870658398]}`

It can be interpreted, that the model assigns the highest probability (0.997..) to the label 'food', which is the correct answer.

Pre-Trained models for text Classification

1 Answers1