8

I am trying to extract named entities from dutch text. I used nltk-trainer to train a tagger and a chunker on the conll2002 dutch corpus. However, the parse method from the chunker is not detecting any named entities. Here is my code:

str = 'Christiane heeft een lam.'

tagger = nltk.data.load('taggers/dutch.pickle')
chunker = nltk.data.load('chunkers/dutch.pickle')

str_tags = tagger.tag(nltk.word_tokenize(str))
print str_tags

str_chunks = chunker.parse(str_tags)
print str_chunks

And the output of this program:

[('Christiane', u'N'), ('heeft', u'V'), ('een', u'Art'), ('lam', u'Adj'), ('.', u'Punc')]
(S Christiane/N heeft/V een/Art lam/Adj ./Punc)

I was expecting Christiane to be detected as a named entity. Any help?

user1491915
  • 1,067
  • 1
  • 14
  • 19
  • What happens when "Christiane" appears in the middle of the sentence? – Fred Foo Jul 02 '12 at 13:19
  • @larsmans No entities either. I even tried with a sentence from the training corpus, but no luck. I used the train_chunker.py on the conll2002 corpus (ned.train) – user1491915 Jul 02 '12 at 13:33
  • Can you show exactly how you used train_chunker.py? My demo at http://text-processing.com/demo/tag/ recognizes Christiane, of course I used train_chunker on conll2002, so there must be a difference in the training arguments. – Jacob Jul 03 '12 at 23:53
  • @Jacob I did `python train_chunker.py conll2002` . I also tried `python train_chunker.py conll2002 --classifier Maxent` , but, after 40 minutes or so, got `ValueError: setting an array element with a sequence.` . How did you train your classifier? – user1491915 Jul 04 '12 at 09:08

1 Answers1

7

The conll2002 corpus has both spanish and dutch text, so you should make sure to use the fileids parameter, as in python train_chunker.py conll2002 --fileids ned.train. Training on both spanish and dutch will have poor results.

The default algorithm is a Tagger based Chunker, which does not work well on conll2002. Instead, use a classifier based chunker like NaiveBayes, so the full command might look like this (and I've confirmed that the resulting chunker does recognize "Christiane" as a "PER"):

python train_chunker.py conll2002 --fileids ned.train --classifier NaiveBayes --filename ~/nltk_data/chunkers/conll2002_ned_NaiveBayes.pickle

Jacob
  • 4,204
  • 1
  • 25
  • 25
  • I've reproduced the problem in question, and it occurs even if the tagger and chunker are trained only on ned.train. Moreover, the chunker seems unable to identify any NEs even on the sentences from the training corpus with the gold POS-tags. – Qnan Jul 08 '12 at 15:31