5

I am a newbie in NLP, just doing it for the first time. I am trying to solve a problem.

My problem is I have some documents which are manually tagged like:

doc1 - categoryA, categoryB
doc2 - categoryA, categoryC
doc3 - categoryE, categoryF, categoryG
.
.
.
.
docN - categoryX

Here I have a fixed set of categories and any document can have any number of tags associated with it. I want to train the classifier using this input, so that this tagging process can be automated.

Thanks

Gabriel M
  • 1,486
  • 4
  • 17
  • 25
user1168811
  • 51
  • 1
  • 2
  • 2
    You need to actually ask us a question instead of simply expressing an intent of solving some problem. What did you try? What problems did you face? What exactly do you want us to try tell you about? – Aditya Mukherji Jan 25 '12 at 15:47
  • Basic "bag of words" analysis would seem like your first stop. Have you tried naive bayes classification of your documents? Many standard tools like `dbacl` are geared more towards many-to-one classification problems, though. – tripleee Jan 25 '12 at 20:41

3 Answers3

4

What you are trying to do is called multi-way supervised text categorization (or classification). Knowing the right question to ask is half the problem.

As for how this can be done, here are two references:

John Lehmann
  • 7,975
  • 4
  • 58
  • 71
3

Most of classifier works on Bag of word model . There are multiple use case to get expected result.

  1. Try out most general Multinomial naive base classifer with changing different input paramters and check out result.

  2. Try variants of ML Naive base (http://scikit-learn.org/0.11/modules/naive_bayes.html)

  3. You can check out sentence classifier along with considering sentence structures. Considering ngram concepts, you can try out with 2,3,4,5 gram models and check how result varies. Count vectorizer allows ngram, check out this link for example - http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

Based on dataset features, not a single classifier can be best for you scenario, you have to check out different use case, which fits best for you.

Most initial approach is, you get started with simple classifier using scikit learn.

  1. Put each category as traning class and train the classifier with this classes

  2. For any input docX, classifier with trained model

  3. You will get probability result for each category
  4. Now put some threshold like probability different between three most highest resulting category, if it matches the threshold consider those category as result for that input class.
user123
  • 5,269
  • 16
  • 73
  • 121
0

its not clear what you have tried or what programming language you are using but as most have suggested try text classification with document vectors, bag of words (as long as there are words in the documents that can help with classification)

Here are some simple tools that can help get you started

Weka http://www.cs.waikato.ac.nz/ml/weka/ (GUI & Java)
NLTK http://www.nltk.org (Python)
Mallet http://mallet.cs.umass.edu/ (command line & Java)
NUML http://numl.net/ (C#)
ryder1211212
  • 92
  • 1
  • 7
  • To ask for clarification, add a comment (once you have the reputation). Just dumping in some links is not very helpful. First, the OP can just use the search engine of their choice. Second, links can go stale, making your answer pointless. – Robert Feb 01 '17 at 20:36