1

I was wondering if there are any NLP techniques for document classification. I was wondering if statistics of n-grams from part-of-speech tagging could be useful? I can't seem to find too much in the literature on the topic..

Has anyone found any nlp technique that enhanced their document classification efforts? If you know of any surveys on this topic that would be awesome.

Note. I saw this question, but my corpus is way too large for the only solution there to be practical.

Community
  • 1
  • 1
anthonybell
  • 5,790
  • 7
  • 42
  • 60
  • I think focusing on the lexicon would be much more productive, especially if you have a large corpus. Sequences of POS give you syntactic differences, but documents written in the same style should have similar distributions, and you might be picking up a writer's idiom or dialect rather than document topic. Try to look at keyword extraction, term extraction, named entity recognition; stats on those should be interesting. Or you could consider just throwing the documents into [LDA](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation), that should get you some quick results. – Amadan Sep 24 '15 at 01:05
  • I agree that lexicon is important, but I am looking for other strategies that can enhance lexicon-based approaches. – anthonybell Sep 24 '15 at 01:33

1 Answers1

3

Quote:

but my corpus is way too large for the only solution there to be practical.

Topic Modelling!

Document classification is a really hot topic at the moment in our research group and other NLP groups. Our primary focus is probabilistic topic modelling. Topic models are an array of algorithms with the aim is to discover the hidden thematic structure in large archives of documents for classification. What is exciting is that there is a lot of room for innovation, invention and just general improvements. Plenty of stuff to work on such as ensembles, hybrids and other statistical techniques.

The Stanford Natural Language Processing Group has a free open source tool for prototyping topic models called the Stanford Topic Modelling Toolbox. I suggest you check it out.

A starting point (Maybe?)

ham-sandwich
  • 3,975
  • 10
  • 34
  • 46
  • Do you think this will still work well if I only have two document classes? – anthonybell Sep 24 '15 at 16:47
  • It would, yes. But you are probably best just using a bag of words model with some sort of binary classifier such as a svm/NB ensemble. I believe this is the state of the art and would be more desirable if you don't plan to scale the categories. – ham-sandwich Sep 24 '15 at 17:50