0

I have 4 different categories and I also have around 3000 words which belong to each of these categories. Now if a new sentence comes, I am able to break the sentence into words and get more words related to it. So say for each new sentence I can get 20-30 words generated from the sentence. Now what is the best way to classify this sentence in above mentioned category? I know bag of words works well. I also looked at LDA, but it works with documents, where as I have a list of words as a training corpus. In LDA it looks at the position of word in document. So I could not get meaningful results from LDA.

rusty
  • 652
  • 7
  • 21
  • Could you explain this, please: "So say for each new sentence I can get 20-30 words generated from the sentence."? How, specifically, are you "generating" words from your sentences? Second, have you tried something like a simple cosine similarity score for your (enriched?) word vectors? – fnl Mar 11 '15 at 13:50
  • I am using google word2Vec to get similar words in the sentence. I have not tried cosine similarity score yet. Thanks for the suggestion, I'll look into that. – rusty Mar 11 '15 at 16:07

2 Answers2

0

I'm not sure if I fully understand what your question is exactly. Bag of words works well for some purposes, but in a lot of cases it throws away a lot of potentially useful information (which could be taken from word order, for example). And assuming that you get a grammatical sentence as input, why not use your sentence as document and still use LDA? The position of a word in your sentence can still be verymeaningful.

There are plenty of classification methods available. Which one is best depens largely on your purpose. If you're neew to this area, this may be interesting to have a look at: https://www.coursera.org/course/ml

Igor
  • 1,251
  • 10
  • 21
0

Like, Igor, I am also a bit confused regarding your problem. Be it a document or a sentence, the terms will be part of the feature set for categorization, in some form. You can find out the most relevant terms of each category and using this knowledge, do a better classification of the new sentences. For example, if your sentence is as follows-" There is a stray dog near our layout which bites everyone who goes near to it". If you take the useful keywords from this sentence, removing stopwords, they are a few in number ( stray, dog, layout, bites, near ). You can categorize it into a bucket, "animals_issue". If you train your system with a larger set of example, this bag of words model can help. Otherwise, you can go for LDA/ other topic modelling approaches.

pnv
  • 1,437
  • 3
  • 23
  • 52