I am working on an application and I would like to infer general categories from the text using natural language processing. I am new to Natural Language Processing (NLP).
The Google Natural Language API does this using a reasonable high-level set of content categories such as "/Arts & Entertainment", "/Hobbies & Leisure", etc:
https://cloud.google.com/natural-language/docs/categories
I am hoping to do this using open source and would like to use some general categories such as the Wikipedia high level classifications:
https://en.wikipedia.org/wiki/Category:Main_topic_classifications
fasttext seems like a good option but I'm struggling to find a corpus to use for training. I do see the wikipedia word vector files and can get the full wikipedia download but I don't see an easy way to get the articles tagged with the categories for fasttext.
Is there some open source tool that can identify high-level general categories given some text -- or is there a training dataset I could use?