1

I'm searching for target text within a large string. My code selects the text within the string and displays 40 characters ahead of it and 40 characters before it. I instead wish to display 2 sentences ahead and 2 sentences after the target text. My code:

import re

sentence = "In addition, participation in life situations can be somewhat impaired because of communicative disabilities associated with the disorder and parents’ lack of resources for overcoming this aspect of the disability (i.e. communication devices). The attitudes of service providers are also important. The Australian Rett syndrome research program is based on a biopsychosocial model which integrates aspects of both medical and social models of disability and functioning. The investigation of environmental factors such as equipment and support available to individuals and families and the social capital of the communities in which they live is likely to be integral to understanding the burden of this disorder. The program will use the ICF framework to identify those factors determined to be most beneficial and cost effective in optimising health, function and quality of life for the affected child and her family."

sub = "biopsychosocial model"

def find_all_substrings(string, sub):
    starts = [match.start() for match in re.finditer(re.escape(sub), string.lower())]
    return starts 

substrings = find_all_substrings(sentence, sub)
for pos in substrings: print(sentence[pos-40:pos+40])

How do I display 2 sentences ahead and 2 sentences after the target text?

Legion
  • 454
  • 2
  • 7
  • 17

1 Answers1

4

You can first split the text into sentences, then find all sentences (with their indices) that contain the substring you are looking for. Then just slice the sentences around any found sentences.

Here's an example (using nltk.tokenize.sent_tokenize):

from nltk.tokenize import sent_tokenize

text = "In addition, participation in life situations can be somewhat impaired because of communicative disabilities associated with the disorder and parents’ lack of resources for overcoming this aspect of the disability (i.e. communication devices). The attitudes of service providers are also important. The Australian Rett syndrome research program is based on a biopsychosocial model which integrates aspects of both medical and social models of disability and functioning. The investigation of environmental factors such as equipment and support available to individuals and families and the social capital of the communities in which they live is likely to be integral to understanding the burden of this disorder. The program will use the ICF framework to identify those factors determined to be most beneficial and cost effective in optimising health, function and quality of life for the affected child and her family."
sentences = sent_tokenize(text)

sub = "biopsychosocial model"
matching_indices = [i for i, sentence in enumerate(sentences) if sub in sentence]

n_sent_padding = 1
displayed_sentences = [
    ' '.join(sentences[i-n_sent_padding:i+n_sent_padding+1])
    for i in matching_indices
]

This will find the index of each sentence that contains the substring (placed in matching_indices) and then displayed_sentences contains the sentences before and after the matching sentence (number according to n_sent_padding.

Then displayed_sentences is:

['The attitudes of service providers are also important. The Australian Rett syndrome research program is based on a biopsychosocial model which integrates aspects of both medical and social models of disability and functioning. The investigation of environmental factors such as equipment and support available to individuals and families and the social capital of the communities in which they live is likely to be integral to understanding the burden of this disorder.']

Pay attention to how nltk splits sentences: sometimes it does it kind of weirdly (e.g. splitting on the period in 'Mr.'). This post is about how to tweak the sentence tokenizer.

Henry Woody
  • 14,024
  • 7
  • 39
  • 56