3

I have been working with Stanford's Named Entity Recognition (NER) tagger (http://nlp.stanford.edu/software/CRF-NER.shtml) in Java and Python, and I've stumbled on an inconsistency that I cannot solve.

Here is the sentence I'm using as an example:

"I said hello to Mr. Jones, and then I went on my way."

In the online demo (http://nlp.stanford.edu:8080/ner/process) of the NER tagger, this returns "Jones" as the named entity for the 3-class model and "Mr. Jones" for the 4-class and 7-class models. I want it to return "Mr. Jones", something the NE_chunker in Python's NLTK has no problem doing.

But, when I try this on my machine (either using the Java GUI or through Python), I only ever get "Jones", without the "Mr.". Interestingly, if I remove the period after "Mr" in this sentence:

"I said hello to Mr Jones, and then I went on my way."

Then I do get "Mr Jones" as my named entity. And even more bizarrely, if I remove all punctuation:

"I said hello to Mr Jones and then I went on my way"

I get only "Jones" again. I have no idea why there is this inconsistency. Especially because the online demo version correctly returns "Mr Jones" in all forms in all three of those sentences for the 7-class model (the one I prefer to use for my project).

Any ideas why this is happening?

Tools/version: Windows 7; Java JDK 1.8.0_121; Stanford CoreNLP 3.7.0 (2016-10-31); Stanford NER 3.7.0 (2016-10-31); Python 3.5; NLTK 3.2.1

Python code to reproduce results:

import nltk
import os
stanford_dir = '/my_path_to/stanford_files/'
jarfile = stanford_dir + 'stanford_ner.jar'
model_7class = stanford_dir + 'classifiers/English.muc.7class.distsim.crf.ser.gz'
postagger_jar = stanford_dir + 'stanford_postagger.jar'
java_path = '/my_path_to/Java/jdk1.8.0_121/bin/java.exe'
os.environ['JAVAHOME'] = java_path

st = nltk.tag.StanfordNERTagger(model_filename=model_7class,path_to_jar=jarfile, encoding='utf-8')
st_tokenize = nltk.tokenize.StanfordTokenizer(path_to_jar=postagger_jar).tokenize

my_sent = 'I said hello to Mr. Jones, and then I went on my way.'
tokens = st_tokenize(my_sent)
tags = st.tag(tokens)
named_entities = [word for word,tag in tags if tag != 'O']
print(named_entities)
user1895076
  • 709
  • 8
  • 19

0 Answers0