Extract proper names from a dataframes

Question

I would like to extract only proper nouns from this dataframe:

          Titles                                                                                         tag_counts
_       
0   [(post, NNS), (italian, JJ), (matt, NN), (Damon, NNP), (..., :), (news, NN...                                   {'NNS': 5, 'JJ': 4, ':': 4, 'NN': 6, 'VBZ': 1, 'IN': 3, 'VBG': 1, 'VBP': 1, 'TO': 1, 'VB': 1, 'V...
1   [(none, DT), (besides, NNS), (of, IN), (the, DT), (apple, NN), (with, IN), (me,...                  {'DT': 3, 'NNS': 2, 'IN': 3, 'NN': 5, 'JJ': 7, ':': 2, 'PRP$': 1, 'VBP': 1, 'CC': 1}
2   [(chocolate,, NN), (Luke, NNP), (Perry,NNP), (flower, JJ), (beer, NN), (...learn, VBP), (More, NNP), (web...    {'NN': 6, 'JJ': 3, 'VBP': 2, 'JJR': 1, 'IN': 5, 'NNS': 3, ':': 2, 'TO': 1, 'VB': 1, 'CC': 1, 'WP...

I have only the following proper names:

(matt, NN)   # this is a proper name, but it is not identified as it is lowercase
(Damon, NNP)
(Luke, NNP)
(Perry, NNP)
(More, NNP) # this is not a proper name.

however the first one (matt) is not recognised as proper name and it let me think if there might be cases where the proper name is tagged as verb, for example. Probably it depends because the first letter is lowercase.

I have tried as follows:

from nltk.tag import pos_tag
from nltk import word_tokenize, pos_tag, pos_tag_sents


texts = df['Titles'].str.split().map(pos_tag)

def count_tags(title_with_tags):
    tag_count = {}
    for word, tag in title_with_tags:
        if tag in tag_count:
            tag_count[tag] += 1
        else:
            tag_count[tag] = 1
    return(tag_count)

texts.map(count_tags).head()

texts = pd.DataFrame(texts)
texts['tag_counts'] = texts['Titles'].map(count_tags)
texts.head()

For identify NNP, I have tried as follows:

prop = [word for word,pos in texts if pos == 'NNP']

but I get this error:

ValueError: too many values to unpack (expected 2)

.

However, I would need to remove improper names, for example More or words that, after a period/full stop, start with a capital letter and that they might be wrongly classified as NNP.

what have you tried? There are a few potentially similar questions from a quick google search: https://stackoverflow.com/questions/20290870/improving-the-extraction-of-human-names-with-nltk — David Erickson, Jul 20 '20 at 21:55
I updated the question. I added more information on the possible issues that might arise, e.g. with an improper tag for those words starting with capital letter after a full stop. — still_learning, Jul 20 '20 at 22:20

Extract proper names from a dataframes

0 Answers0