How to get general categories for text using NLP like fasttext?

Question

I am working on an application and I would like to infer general categories from the text using natural language processing. I am new to Natural Language Processing (NLP).

The Google Natural Language API does this using a reasonable high-level set of content categories such as "/Arts & Entertainment", "/Hobbies & Leisure", etc:

https://cloud.google.com/natural-language/docs/categories

I am hoping to do this using open source and would like to use some general categories such as the Wikipedia high level classifications:

https://en.wikipedia.org/wiki/Category:Main_topic_classifications

fasttext seems like a good option but I'm struggling to find a corpus to use for training. I do see the wikipedia word vector files and can get the full wikipedia download but I don't see an easy way to get the articles tagged with the categories for fasttext.

Is there some open source tool that can identify high-level general categories given some text -- or is there a training dataset I could use?

score 2 · Accepted Answer · answered Nov 04 '20 at 21:06

I'd suggest using the "zero-shot classification" pipeline the HuggingFace Transformers library. It's very easy to use and has decent accuracy given that you don't need to train anything yourself. Here is an interactive web application to see what it does without coding. Here is a Jupyter notebook which demonstrates how to use it in Python. You can just copy-paste code from the notebook.

This would look something like this:

# pip install transformers==3.4.0  # pip install in terminal
from transformers import pipeline

classifier = pipeline("zero-shot-classification")

sequence = "I like just watching TV during the night"
candidate_labels = ["arts", "entertainment", "politics", "economy", "cooking"]

classifier(sequence, candidate_labels)

# output: 
'labels': ['entertainment', 'economy', 'politics', 'arts', 'cooking'],
'scores': [0.939170241355896, 0.13490302860736847, 0.011731419712305069, 0.0025395064149051905, 0.00018942927999887615]

Here are details on the theory, if you are interested.

Fascinating. I will check this out. Thank you! – Duane Nov 04 '20 at 23:59 — Duane, Nov 04 '20 at 23:59

score 1 · Answer 2 · answered Nov 04 '20 at 17:48

I think what you are trying to find is an already free trained model that has general categories where you can classify text. But that will be so difficult to find since the nature of the categories, usually those are services like Google Cloud Natural Language API.

At this point I think you have two options:

Use services like Google Cloud Natural Language API, this is a service that provides you a model already trained with a millions of data points, you can integrate it within your application, just need to consider the pricing
You'll need first to gather a desired dataset that contains all the text you want to classify, the categories where these text resides (or manipulate the dataset to add the desired categories depending on the text), then you can use libraries such as SpaCy or NLTK to manipulate the data and train your model for text classification.

How to get general categories for text using NLP like fasttext?

2 Answers2