I would like to extract only proper nouns from this dataframe:
Titles tag_counts
_
0 [(post, NNS), (italian, JJ), (matt, NN), (Damon, NNP), (..., :), (news, NN... {'NNS': 5, 'JJ': 4, ':': 4, 'NN': 6, 'VBZ': 1, 'IN': 3, 'VBG': 1, 'VBP': 1, 'TO': 1, 'VB': 1, 'V...
1 [(none, DT), (besides, NNS), (of, IN), (the, DT), (apple, NN), (with, IN), (me,... {'DT': 3, 'NNS': 2, 'IN': 3, 'NN': 5, 'JJ': 7, ':': 2, 'PRP$': 1, 'VBP': 1, 'CC': 1}
2 [(chocolate,, NN), (Luke, NNP), (Perry,NNP), (flower, JJ), (beer, NN), (...learn, VBP), (More, NNP), (web... {'NN': 6, 'JJ': 3, 'VBP': 2, 'JJR': 1, 'IN': 5, 'NNS': 3, ':': 2, 'TO': 1, 'VB': 1, 'CC': 1, 'WP...
I have only the following proper names:
(matt, NN) # this is a proper name, but it is not identified as it is lowercase
(Damon, NNP)
(Luke, NNP)
(Perry, NNP)
(More, NNP) # this is not a proper name.
however the first one (matt
) is not recognised as proper name and it let me think if there might be cases where the proper name is tagged as verb, for example. Probably it depends because the first letter is lowercase.
I have tried as follows:
from nltk.tag import pos_tag
from nltk import word_tokenize, pos_tag, pos_tag_sents
texts = df['Titles'].str.split().map(pos_tag)
def count_tags(title_with_tags):
tag_count = {}
for word, tag in title_with_tags:
if tag in tag_count:
tag_count[tag] += 1
else:
tag_count[tag] = 1
return(tag_count)
texts.map(count_tags).head()
texts = pd.DataFrame(texts)
texts['tag_counts'] = texts['Titles'].map(count_tags)
texts.head()
For identify NNP, I have tried as follows:
prop = [word for word,pos in texts if pos == 'NNP']
but I get this error:
ValueError: too many values to unpack (expected 2)
.
However, I would need to remove improper names, for example More or words that, after a period/full stop, start with a capital letter and that they might be wrongly classified as NNP.