10

I am trying to extract proper nouns as in Names and Organization names from very small chunks of texts like sms, the basic parsers available with nltk Finding Proper Nouns using NLTK WordNet are being able to get the nouns but the problem is when we get proper nouns not starting with a capital letter , for texts like this the names like sumit do not get recognized as proper nouns

>>> sentence = "i spoke with sumit and rajesh and Samit about the gridlock situation last night @ around 8 pm last nite"
>>> tagged_sent = pos_tag(sentence.split())
>>> print tagged_sent
[('i', 'PRP'), ('spoke', 'VBP'), ('with', 'IN'), **('sumit', 'NN')**, ('and', 'CC'), ('rajesh', 'JJ'), ('and', 'CC'), **('Samit', 'NNP'),** ('about', 'IN'), ('the', 'DT'), ('gridlock', 'NN'), ('situation', 'NN'), ('last', 'JJ'), ('night', 'NN'), ('@', 'IN'), ('around', 'IN'), ('8', 'CD'), ('pm', 'NN'), ('last', 'JJ'), ('nite', 'NN')]
Community
  • 1
  • 1
Brij Raj Singh - MSFT
  • 4,903
  • 7
  • 36
  • 55

3 Answers3

9

There is a better way to extract names of people and organizations

from nltk import pos_tag, ne_chunk
from nltk.tokenize import SpaceTokenizer

tokenizer = SpaceTokenizer()
toks = tokenizer.tokenize(sentence)
pos = pos_tag(toks)
chunked_nes = ne_chunk(pos) 

nes = [' '.join(map(lambda x: x[0], ne.leaves())) for ne in chunked_nes if isinstance(ne, nltk.tree.Tree)]

However all Named Entity Recognizers commit errors. If you really don't want to miss any proper name, you could use a dict of Proper Names and check if the name is contained in the dict.

mbatchkarov
  • 15,487
  • 9
  • 60
  • 79
user278064
  • 9,982
  • 1
  • 33
  • 46
  • thanks @mbatchkarov if i do have a vast dictionary of names ( which i do have) how do i make one in python, please advise, your answer looks good I'll try it – Brij Raj Singh - MSFT Oct 21 '13 at 13:36
2

You might want to have a look at python-nameparser. It tries to guess capitalization of names also. Sorry for the incomplete answer but I don't have much experience using python-nameparser.

Best of luck!

Saheel Godhane
  • 313
  • 4
  • 14
  • 1
    Well its just a nameparser like netgender, As long as you have a name you can parse it, but the idea is to extract names, no matter if they are written like "sumit" or "Sumit" or "SUMIT" – Brij Raj Singh - MSFT Oct 22 '13 at 05:35
0

try this code

def get_entities(self,args):
    qry = "who is Mahatma Gandhi"
    tokens = nltk.tokenize.word_tokenize(qry)
    pos = nltk.pos_tag(tokens)
    sentt = nltk.ne_chunk(pos, binary = False)
    print sentt
    person = []
    for subtree in sentt.subtrees(filter=lambda t: t.node == 'PERSON'):
        for leave in subtree.leaves():
            person.append(leave)
    print "person=", person

You can get names of person, organization, locations with the help of this ne_chunk() function. Hope it helps. Thankz

Gunjan
  • 2,775
  • 27
  • 30