6

So I have few words without labels but I need to classify them into 4-5 categories. I can visibly say that this test set can be classified. Although I do not have training data so I need to use a pre-trained model to classify these words. Which model is good for this paradigm and on which dataset has it already been trained?

Thanks

scifi_bot
  • 61
  • 1
  • 4

1 Answers1

8

The task we are taking about is called Zero-Shot Topic Classification - predicting a topic that the model has not been trained on. This paradigm is supported by Hugging Face library, you can read more here. The most common pre-trained model is Bart Large MNLI - the checkpoint for bart-large after being trained on the MNLI dataset. Here is a simple example, showing the classification of phrase "I like hot dogs" without any preliminary training:

  1. First of all, please install the transformers library:

    pip install --upgrade transformers
    
  2. Then import and initialize the pipeline:

    from transformers import pipeline
    
    classifier = pipeline('zero-shot-classification', model='facebook/bart-large-mnli')
    
  3. Enter our toy dataset:

     labels = ["artifacts", "animals", "food", "birds"]
     hypothesis_template = 'This text is about {}.'
     sequence = "I like hot dogs"
    
  4. Predict the label:

    prediction = classifier(sequence, labels, hypothesis_template=hypothesis_template, multi_class=True)
    
    print(prediction)
    

The output will be somethng like

`{'sequence': 'i like hot dogs', 
'labels': ['food', 'animals', 'artifacts', 'birds'], 
'scores': [0.9971900582313538, 0.00529429130256176, 0.0020991512574255466, 
0.00023589911870658398]}`

It can be interpreted, that the model assigns the highest probability (0.997..) to the label 'food', which is the correct answer.

David Alami
  • 106
  • 2