0

I am reading a news article and pos-tagging with nltk. I want to remove those lines that does not have a pos tag like CD (numbers).

import io
import nltk
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize
from nltk import pos_tag
stop_words = set(stopwords.words('english')) 
file1 = open("etorg.txt") 
line = file1.read()
file1.close()
print(line)
words = line.split() 
tokens = nltk.pos_tag(words)

How do I remove all sentences that do not contain the CD tag?

kunif
  • 4,060
  • 2
  • 10
  • 30
nkrishna
  • 25
  • 8
  • can you give an example of your output? – Josh Friedlander Jan 31 '19 at 09:00
  • MNC claims 21 million sales in September. However, industry sources do not confirm this data. It is estimated that the reported sales could be in the range of fifteen to 18 million. -- The second sentence should be removed from the text since it does not have any numbers. – nkrishna Jan 31 '19 at 12:37

1 Answers1

0

Just use [word for word in tokens if word[1] != 'CD']

EDIT: To get the sentences that have no numbers, use this code:

def has_number(sentence):
    for i in nltk.pos_tag(sentence.split()):
        if i[1] == 'CD':
            return ''
    return sentence

line = 'MNC claims 21 million sales in September. However, industry sources do not confirm this data. It is estimated that the reported sales could be in the range of fifteen to 18 million. '

''.join([has_number(x) for x in line.split('.')])

> ' However, industry sources do not confirm this data '
Josh Friedlander
  • 10,870
  • 5
  • 35
  • 75