I have been working with Stanford's Named Entity Recognition (NER) tagger (http://nlp.stanford.edu/software/CRF-NER.shtml) in Java and Python, and I've stumbled on an inconsistency that I cannot solve.
Here is the sentence I'm using as an example:
"I said hello to Mr. Jones, and then I went on my way."
In the online demo (http://nlp.stanford.edu:8080/ner/process) of the NER tagger, this returns "Jones"
as the named entity for the 3-class model and "Mr. Jones"
for the 4-class and 7-class models. I want it to return "Mr. Jones"
, something the NE_chunker in Python's NLTK has no problem doing.
But, when I try this on my machine (either using the Java GUI or through Python), I only ever get "Jones"
, without the "Mr."
. Interestingly, if I remove the period after "Mr" in this sentence:
"I said hello to Mr Jones, and then I went on my way."
Then I do get "Mr Jones"
as my named entity. And even more bizarrely, if I remove all punctuation:
"I said hello to Mr Jones and then I went on my way"
I get only "Jones"
again. I have no idea why there is this inconsistency. Especially because the online demo version correctly returns "Mr Jones"
in all forms in all three of those sentences for the 7-class model (the one I prefer to use for my project).
Any ideas why this is happening?
Tools/version: Windows 7; Java JDK 1.8.0_121; Stanford CoreNLP 3.7.0 (2016-10-31); Stanford NER 3.7.0 (2016-10-31); Python 3.5; NLTK 3.2.1
Python code to reproduce results:
import nltk
import os
stanford_dir = '/my_path_to/stanford_files/'
jarfile = stanford_dir + 'stanford_ner.jar'
model_7class = stanford_dir + 'classifiers/English.muc.7class.distsim.crf.ser.gz'
postagger_jar = stanford_dir + 'stanford_postagger.jar'
java_path = '/my_path_to/Java/jdk1.8.0_121/bin/java.exe'
os.environ['JAVAHOME'] = java_path
st = nltk.tag.StanfordNERTagger(model_filename=model_7class,path_to_jar=jarfile, encoding='utf-8')
st_tokenize = nltk.tokenize.StanfordTokenizer(path_to_jar=postagger_jar).tokenize
my_sent = 'I said hello to Mr. Jones, and then I went on my way.'
tokens = st_tokenize(my_sent)
tags = st.tag(tokens)
named_entities = [word for word,tag in tags if tag != 'O']
print(named_entities)