2

For text that contains company names I want to train a model that automatically tags contractors (company executing the task) and principals (company hiring the contractor).

An example sentence would be:

Blossom Inc. hires the consultants of Big Think to develop an outsourcing strategy.

with Blossom Inc as the principal and Big Think as the contractor.

My first question: Is it enough to tag only the principals and contractors in my training set or is it better to additionally use POS-tagging?

In other words, either

Blossom/PRINCIPAL Inc./PRINCIPAL hires/NN the/NN consultants/NN of/NN Big/CONTRACTOR Think/CONTRACTOR to/NN develop/NN an/NN outsourcing/NN strategy/NN ./.

or

Blossom/PRINCIPAL Inc./PRINCIPAL hires/VBZ the/DT consultants/NNS of/IN Big/CONTRACTOR Think/CONTRACTOR to/TO develop/VB an/DT outsourcing/NN strategy/NN ./.

Second question: Once I have my training set, which algorithm(s) of the nltk-package is/are most promising? N-Gram Tagger, Brill Tagger, TnT Tagger, Maxent Classifier, Naive Bayes, ...? Or am I completely on the wrong track here?

I am new to NLP and I just wanted to ask for advice before I invest a lot of time in tagging my training set. And my text is in German, which might add some difficulties... Thanks for any advice!

tobip
  • 465
  • 1
  • 5
  • 18

3 Answers3

2

I'd advise you not to merge named entities and POS information. Most work has showed that POS (or otherwise some morphological and/or capitalization features) is valuable for detecting named entities. As you can quite safely use an automatic POS tagger (unless you process noisy texts), you may end up with something like:

Blossom/NNP/PRINCIPAL Inc./NNP/PRINCIPAL hires/VBZ/O the/DT/O consultants/NNS/O of/IN/O Big/NNP/CONTRACTOR Think/NNP/CONTRACTOR to/TO/O develop/VB/O an/DT/O outsourcing/NN/O strategy/NN/O ./.

where POS level would be automatically tagged while you can manually annotate PRINCIPAL and CONTRACTOR. Also note that most people use BIO format for tagging named entities.

Keep in mind that recognizing organizations is usually quite hard - at least harder than persons and locations. Unless you have a predefined list of organizations, large lexicon are required for this. Intuitivelly, I guess you could divide your task in:

  1. Recognize and filtering organizations (ORG), for instance by using a NER tagger
  2. Inject additional processings (patterns/syntax/semantics)
  3. Implement a second model that transform relevant ORG in PRINCIPAL or CONTRACTOR
eldams
  • 700
  • 6
  • 14
  • `Most work has shown` sounds fluffy, and I know for a fact that it's not true. Not to mention you're implying that POS is a morphological feature, and that capitalization is comparably to POS, neither of which is true. – Slater Victoroff Jan 08 '14 at 22:13
  • Have a look there http://l2r.cs.uiuc.edu/~danr/Papers/RatinovRo09.pdf about NER/POS. Also, I do not imply POS is a morphological feature, but that for NER, if you don't use POS (which gives you NNPs), you'll have to use some morphological features or at least capitalization. – eldams Jan 08 '14 at 22:24
  • That paper shows POS as an example of a feature that might be used. Very different from saying that POS tagging is required. Also standard POS taggers (specifically the one in NLTK) typically only hit ~60-70% accuracy on new text, so building NER on it is kind of like building a castle on sand. Also your second point is like saying: If you don't use this feature you have to use some other feature. Conll (the most prevalent NER corpus) doesn't contain POS, which makes this a pretty moot point. – Slater Victoroff Jan 08 '14 at 22:52
  • Should I really argue? Quote from cited paper: "Most NER systems use additional features, such as POS tags, shallow parsing information and gazetteers". CoNLL data format do contain POS: http://www.cnts.ua.ac.be/conll2003/ner/ . Nonetheless, let me slightly modify my answer to take into account your comment. – eldams Jan 09 '14 at 11:00
1
  1. You don't need to POS tag manually. The POS tagger will do it for you.
  2. See this question for POS tagging German.
Community
  • 1
  • 1
cyborg
  • 9,989
  • 4
  • 38
  • 56
0

Named Entity Recognition(Stanford) is enough for your problem.

Using POS tagging will not help your problem.

A sufficient amount of training data for generating the NER model would give you good results.

If you use the Stanford NER then it uses the CRF classifier and algorithm.

Rohan Amrute
  • 764
  • 1
  • 9
  • 23