Building a thesaurus from corpus

Question

I am working on a natural language processing application. I have a text describing 30 domains. Each domain is defined with a short paragraph that explains it. My aim is to build a thesaurus from this text so I can determine from an input string which domains are concerned. The text is about 5000 words and each domains is described by 150 words. My questions are :

Do I have a long enough text to create a thesaurus from ?

Is my idea of building a thesaurus legit or should I just use NLP libraries to analyse my corpus and the input string ?

At the moment, I have calculated the number total of occurrence of each words grouped by domains because I first thought of a indexed approach. But I am really not sure which method is the best. Does someone have experience in both NLP and thesaurus building ?

score 2 · Accepted Answer · answered Jun 13 '14 at 16:49

2

I think what you are looking for is topic modeling. Given a word, you want to get the probability of which domain the word belongs to. I would recommend using off the shelf algorithms that implement LDA (Latent Dirichlet Algorithm). Alternatively, you can visit David Blei's website. He has written some great software that implements LDA, and topic modeling in general. He also has presented several tutorials for topic modeling for beginners.

answered Jun 13 '14 at 16:49

batgirl

421
4
8

It seems to be what I was looking for. I also heard about HMM (Markov) algorithm which is pretty relevant. I'll study the two of them and see how a should use them. Thank you very much – Kabulan0lak Jun 16 '14 at 08:56

score 1 · Answer 2 · answered Jun 12 '14 at 08:24

1

If your goal is to build a thesaurus then build a thesaurus; if your goal is not to build a thesaurus, then you better use stuff available out there.

More generally, for any task in NLP - from data acquisition to machine translation - you're gonna face numerous problems (both technical and theoretical), and it is very easy to stray from the path, as these problems are - most of the time - fascinating.

Whatever the task is, build a system using existing resources. Then you get the big picture; then you can start thinking about improving component A or B.

Good luck.

answered Jun 12 '14 at 08:24

Pierre

1,204
8
15

My goal isn't to build or not to build a thesaurus, it is to understand what my application's users are writing to conclude which domains they are talking about. If it involves building a thesaurus, I'll build it. But does it ? And what do you mean by "build a system using existing resources" ? Which resources are you thinking of ? Thank you for your help. – Kabulan0lak Jun 12 '14 at 08:56
There is a lot of resources available out there (lexicons, tokenizer, stemmer, parser, named entity recognizer, etc.), you should be able to build your system just by combining them. As for a thesaurus, you might want to start with Wordnet (which is free). – Pierre Jun 12 '14 at 09:05
Ok thank you. I'll do everything I can with what exists. – Kabulan0lak Jun 12 '14 at 09:24

Building a thesaurus from corpus

2 Answers2