What is the approach to generate Topics from text using a wikipedia dump

Question

I'm new to NLP/text processing

and building an application which requires generating topics (Music, Games, Romance, History etc etc.) from about 2 lines of imput text.

I've decided to use wikipedia's articlebase to help me out in this process,

What would be steps to "train" my program to recognize and categorize these topics from my input text?

Where does Wikipedia come into the picture? To train anything, you need input which is already categorized according to your criteria, which (by any stretch of imagination) a raw dump of Wikipedia text is not. — tripleee, Apr 10 '15 at 04:45
But this is much too broad to be answered by anything less than an introductory textbook. Nominating to close. — tripleee, Apr 10 '15 at 04:47

score 1 · Answer 1 · answered Apr 10 '15 at 04:35

Such a broad question. For automated topic modeling (where you don't have to train a model) you might want to look at Latent Dirichlet allocation. In python, gensim is a nice way to do LDA. I've used Weka in Java for classification tasks, which might be more what you're looking at. And LightSide Researcher's work bench offers a GUI for text mining tasks.

What is the approach to generate Topics from text using a wikipedia dump

1 Answers1