0

I'm working on a text-mining use case in python. These are the sentences of interest:

As a result may continue to be adversely impacted, by fluctuations in foreign currency exchange rates. Certain events such as the threat of additional tariffs on imported consumer goods from China, have increased. Stores are primarily located in shopping malls and other shopping centers.

How can I extract the sentence with the keyword "China"? I do need a sentence before and after that, actually atleast two sentences before and after.

I've tried the below, as was answered here:

import nltk
from nltk.tokenize import word_tokenize
sents = nltk.sent_tokenize(text)
my_sentences = [sent for sent in sents if 'China' in word_tokenize(sent)]

Please help!

PratikSharma
  • 321
  • 2
  • 17
  • It's not entirely clear what `look for China and take the entire sentence` means. – Robo Mop May 15 '19 at 13:38
  • Do you want the sentence including China first? Or do you want the sentence including China from `US` till the end? – Robo Mop May 15 '19 at 13:39
  • I want the entire sentence i.e. (i) search for China (ii) ignore wrong sentence boundary such as the one here – PratikSharma May 15 '19 at 13:40
  • Also, the quoted text - that's all one sentence, right? – Robo Mop May 15 '19 at 13:41
  • yes the quoted text. and also if it's possible extract three previous and three next sentences as well – PratikSharma May 15 '19 at 13:43
  • You could just find `China` then use a loop to find the full sentence by looking for `.`, and just keep track of opening and closing quotes so you know to ignore quoted blocks in your period counter. (you wouldn't even need regex). – Error - Syntactical Remorse May 15 '19 at 13:52
  • You cannot make a difference between sentence ending dot and dot for abbreviations, so it will not work with regular expression. – Norbert Incze May 15 '19 at 13:56
  • @NorbertIncze Why not? There is only one way to abbreviate, so we can check for that. – Robo Mop May 15 '19 at 13:57
  • Example: "First sentence. Abbreviations in the second sentence ABBR1., abbr2.; China. Third sentence." Here the correct answer would be "Abbreviations in the second sentence ABBR1., abbr2.; China." and I do not see how could you get it with regex, or anything else except AI. – Norbert Incze May 15 '19 at 14:03

1 Answers1

1

TL;DR

Use sent_tokenize, keep track of the index where the focus word and window the sentences to get the desired result.

from itertools import chain
from nltk import sent_tokenize, word_tokenize
from nltk.tokenize.treebank import TreebankWordDetokenizer

word_detokenize = TreebankWordDetokenizer().detokenize

text = """As a result may continue to be adversely impacted, by fluctuations in foreign currency exchange rates. Certain events such as the threat of additional tariffs on imported consumer goods from China, have increased global economic and political uncertainty and caused volatility in foreign currency exchange rates. Stores are primarily located in shopping malls and other shopping centers, certain of which have been experiencing declines in customer traffic."""

tokenized_text = [word_tokenize(sent) for sent in sent_tokenize(text)]

sent_idx_with_china = [idx for idx, sent in enumerate(tokenized_text) 
                       if 'China' in sent or 'china' in sent]

window = 2 # If you want 2 sentences before and after.

for idx in sent_idx_with_china:
    start = max(idx - window, 0)
    end = min(idx+window, len(tokenized_text))
    result = ' '.join(word_detokenize(sent) for sent in tokenized_text[start:end])
    print(result)

Another example, pip install wikipedia first:

from itertools import chain
from nltk import sent_tokenize, word_tokenize
from nltk.tokenize.treebank import TreebankWordDetokenizer

word_detokenize = TreebankWordDetokenizer().detokenize

import wikipedia

text =  wikipedia.page("Winnie The Pooh").content

tokenized_text = [word_tokenize(sent) for sent in sent_tokenize(text)]

sent_idx_with_china = [idx for idx, sent in enumerate(tokenized_text) 
                       if 'China' in sent or 'china' in sent]

window = 2 # If you want 2 sentences before and after.

for idx in sent_idx_with_china:
    start = max(idx - window, 0)
    end = min(idx+window, len(tokenized_text))
    result = ' '.join(word_detokenize(sent) for sent in tokenized_text[start:end])
    print(result)
    print()

[out]:

Ashdown Forest in England where the Pooh stories are set is a popular tourist attraction, and includes the wooden Pooh Bridge where Pooh and Piglet invented Poohsticks. The Oxford University Winnie the Pooh Society was founded by undergraduates in 1982. == Censorship in China == In the People's Republic of China, images of Pooh were censored in mid-2017 from social media websites, when internet memes comparing Chinese president Xi Jinping to Pooh became popular. The 2018 film Christopher Robin was also denied a Chinese release.

alvas
  • 115,346
  • 109
  • 446
  • 738
  • Thanks @alvas. Since I had extracted this text from a PDF I changed result to a list to get all sentences with the given window from the PDF. Works fine. Thanks! Upvote! – PratikSharma May 15 '19 at 14:18