1

Suppose this is my filecontent:

When they are over 45 years old!! It would definitely help Michael Jordan.

Below is my code for tagging setences.

st = NERTagger('stanford-ner/classifiers/english.all.3class.distsim.crf.ser.gz', 'stanford-ner/stanford-ner.jar')
tokenized_sents = [word_tokenize(sent) for sent in sent_tokenize(filecontent)]  
taggedsents = st.tag_sents(tokenized_sents)

I would expect both tokenized_sents and taggedsents contain the same number of sentences.

But here is what they contain:

for ts in tokenized_sents:
    print "tok   ", ts

for ts in taggedsents:
    print "tagged    ",ts

>> tok    ['When', 'they', 'are', 'over', '45', 'years', 'old', '!', '!']
>> tok    ['It', 'would', 'definitely', 'help', '.']
>> tagged     [(u'When', u'O'), (u'they', u'O'), (u'are', u'O'), (u'over', u'O'), (u'45', u'O'), (u'years', u'O'), (u'old', u'O'), (u'!', u'O')]
>> tagged     [(u'!', u'O')]
>> tagged     [(u'It', u'O'), (u'would', u'O'), (u'definitely', u'O'), (u'help', u'O'), (u'Michael', u'PERSON'), (u'Jordan', u'PERSON'), (u'.', u'O')]

This is due to having doulbe "!" at the end of the supposed first sentence. Do I have to remove double "!"s before using st.tag_sents()

How should I resolve this?

samsamara
  • 4,630
  • 7
  • 36
  • 66
  • There are no `named entities` in your data. See https://en.wikipedia.org/wiki/Named-entity_recognition . Try a sentence like 'Michael Jordan went to Apple Inc. to buy and iPad Air for his daugther Layla Jordan' – alvas Nov 17 '15 at 10:58
  • The sentence tokenization is a weird thing so if you change `['!', '!']` to `['!!']`, it should work. You're working with noisy data. Stanford tools are built on clean data, so it might not scale to any domain / genre – alvas Nov 17 '15 at 11:00
  • it's not a prob with having no NEs (have added a ne to the string but still the same). – samsamara Nov 17 '15 at 11:02
  • Yeah so its a weired problem with tokenization. – samsamara Nov 17 '15 at 11:03
  • no idea why it can't just use the sentences I passed to 'tag_sents()' without further tokenizing! – samsamara Nov 17 '15 at 11:05
  • follow StanfordNLPHelp instructions if you're not bent on at must to use NLTK, otherwise, it will take some time for an answer as to why the NLTK API don't work as you expect. And yet some more time for NLTK to improve the API such that it keeps the tokenization provided by the user. – alvas Nov 17 '15 at 21:49
  • You could replace the `!` with a null character so that it does not fail. – Rohan Amrute Dec 24 '15 at 05:45
  • @RohanAmrute yes but then there could be other characters that fails as well – samsamara Dec 24 '15 at 07:57
  • I think there is no fool-proof way of of doing this. Have to test it by trial and error technique – Rohan Amrute Dec 24 '15 at 08:22

1 Answers1

1

If you follow my solution from the other question instead of using nltk you will get JSON that properly splits this text into two sentences.

Link to previous question: how to speed up NE recognition with stanford NER with python nltk

Community
  • 1
  • 1
StanfordNLPHelp
  • 8,699
  • 1
  • 11
  • 9