How to improve dutch NER chunkers in NLTK

Question

Thanks to this great answer I got a good start training my own NE chunker for Dutch, using NLTK and the Conll2002 corpus: NLTK named entity recognition in dutch. Using these hints I was also able to easily train an improved tagger (bases on IIS classification) that tags at around 95% accuracy, which is enough for my purposes.

However, the F-measure of the named entity recognition is only around 40%. How can I improve this? I tried using built in algorithms like Maxent, but I only get a memory error. Then I moved on to try and get Megam to work, but it won't compile on my Windows machine and there is no binary available anymore. I also ran into dead ends trying to incorporate other software or methods like libSVM, YamCha, CRF++ and Weka. All have their own manual and problems, which seem to keep stacking up. So I'm feeling a bit overwhelmed.

What I need is a practical approach to NER for Dutch. There has been a lot of research and I found papers quoting F-measures between 70% and 85%. That would be great! Does anyone have a hint as to where I could find an improved implementation or how I could build one myself (using Windows)? I would prefer to use NLTK for it's flexibility, but if there is a standard solution in a different toolkit I'm game for that, too. Even commercial tools would be welcome.

Here is the code I use for the evaluation now:

import nltk

from nltk.corpus import conll2002

tokenizer = nltk.data.load('tokenizers/punkt/dutch.pickle')
tagger = nltk.data.load('taggers/conll2002_ned_IIS.pickle')
chunker = nltk.data.load('chunkers/conll2002_ned_NaiveBayes.pickle')

test_sents = conll2002.tagged_sents(fileids="ned.testb")[0:1000]

print "tagger accuracy on test-set: " + str(tagger.evaluate(test_sents))

test_sents = conll2002.chunked_sents(fileids="ned.testb")[0:1000]

print chunker.evaluate(test_sents)

# chunker trained with following commandline: 
# python train_chunker.py conll2002 --fileids ned.train --classifier NaiveBayes --filename /nltk_data/chunkers/conll2002_ned_NaiveBayes.pickle

try adding more data, or incorporating more knowledge from sources like `YAGO` and `Freebase`. — alvas, Jul 19 '13 at 09:39
Alas, there is no other NER-annotated corpus for Dutch (yet, SoNaR will be coming soon). Also using more data will only marginally improve results, while I already know that another algorithm can improve them dramatically. I really hope there is someone out there who has done this before. — titusn, Jul 19 '13 at 11:15
We ended up using Frog instead: http://ilk.uvt.nl/frog/ Please comment or contact me if you have experience using this or other Dutch/English NER-taggers. — titusn, Feb 10 '14 at 16:49

How to improve dutch NER chunkers in NLTK

0 Answers0

Linked