How to remove an entire line if it does not have a pos tag like CD?

Question

I am reading a news article and pos-tagging with nltk. I want to remove those lines that does not have a pos tag like CD (numbers).

import io
import nltk
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize
from nltk import pos_tag
stop_words = set(stopwords.words('english')) 
file1 = open("etorg.txt") 
line = file1.read()
file1.close()
print(line)
words = line.split() 
tokens = nltk.pos_tag(words)

How do I remove all sentences that do not contain the CD tag?

MNC claims 21 million sales in September. However, industry sources do not confirm this data. It is estimated that the reported sales could be in the range of fifteen to 18 million. -- The second sentence should be removed from the text since it does not have any numbers. — nkrishna, Jan 31 '19 at 12:37

Josh Friedlander · Accepted Answer · 2019-01-31T12:59:46.113

0

Just use [word for word in tokens if word[1] != 'CD']

EDIT: To get the sentences that have no numbers, use this code:

def has_number(sentence):
    for i in nltk.pos_tag(sentence.split()):
        if i[1] == 'CD':
            return ''
    return sentence

line = 'MNC claims 21 million sales in September. However, industry sources do not confirm this data. It is estimated that the reported sales could be in the range of fifteen to 18 million. '

''.join([has_number(x) for x in line.split('.')])

> ' However, industry sources do not confirm this data '

edited Jan 31 '19 at 12:59

answered Jan 31 '19 at 08:49

Josh Friedlander

10,870
5
35
75

this will only remove the word in a sentence. but i want to remove the entire sentence. – nkrishna Jan 31 '19 at 08:58

How to remove an entire line if it does not have a pos tag like CD?

1 Answers1