Sentence selection surrounded to a particular words

Question

Suppose I have a paragraph:

Str_wrds ="Power curve, supplied by turbine manufacturers, are extensively used in condition monitoring, energy estimation, and improving operational efficiency. However, there is substantial uncertainty linked to power curve measurements as they usually take place only at hub height. Data-driven model accuracy is significantly affected by uncertainty. Therefore, an accurate estimation of uncertainty gives the confidence to wind farm operators for improving performance/condition monitoring and energy forecasting activities that are based on data-driven methods. The support vector machine (SVM) is a data-driven, machine learning approach, widely used in solving problems related to classification and regression. The uncertainty associated with models is quantified using confidence intervals (CIs), which are themselves estimated. This study proposes two approaches, namely, pointwise CIs and simultaneous CIs, to measure the uncertainty associated with an SVM-based power curve model. A radial basis function is taken as the kernel function to improve the accuracy of the SVM models. The proposed techniques are then verified by extensive 10 min average supervisory control and data acquisition (SCADA) data, obtained from pitch-controlled wind turbines. The results suggest that both proposed techniques are effective in measuring SVM power curve uncertainty, out of which, pointwise CIs are found to be the most accurate because they produce relatively smaller CIs."

And have the following test_wrds,

Test_wrds = ['Power curve', 'data-driven','wind turbines']

I would like to select before and after 1 sentence whenever Test_wrds found it in a paragraph and list them as a separate string. For example, Test_wrds Power curve appeared first in 1st sentence hence but when we select 2nd sentence there are another Power curve words thus the output would be something like this

Power curve, supplied by turbine manufacturers, are extensively used in condition monitoring, energy estimation, and improving operational efficiency. However, there is substantial uncertainty linked to power curve measurements as they usually take place only at hub height. Therefore, an accurate estimation of uncertainty gives the confidence to wind farm operators for improving performance/condition monitoring and energy forecasting activities that are based on data-driven methods.

And likewise, I would like to slice sentences for data-driven and wind turbines and saved them in separate strings.

How can I implement this using Python in a simple way?

So far I found code which basically removes the entire sentence whenever any Text_wrds is in.

def remove_sentence(Str_wrds , Test_wrds):
    return ".".join((sentence for sentence in input.split(".")
                    if Test_wrds not in sentence))

But I don't understand how to use this for my problem.

update on the problem: Basically, whenever there is test_wrds present in the paragraph, I would like to slice that sentence as well as before and after one sentence and saved it on a single string. So for example for three text_wrds I am expected to get three strings which basically covers sentences with text_wrds individually. I attached pdf, for example, the output, I am looking for

Hi, I can't understand what you mean by this part. Could you rephrase it ? Thanks "I would like to select before and after 1 sentence whenever Test_wrds found it in a paragraph and list them as a separate string. For example, Test_wrds Power curve appeared first in 1st sentence hence but when we select 2nd sentence there are another Power curve words thus the output would be something like" — Cukic0d, Jan 26 '21 at 11:35
`"Power curve, supplied by turbine manufacturers, are extensively used in condition monitoring, energy estimation, and improving operational efficiency. However, there is substantial uncertainty linked to power curve measurements as they usually take place only at hub height. Data-driven model accuracy is significantly affected by uncertainty. ` your output shouldhave been this — Ajay, Jan 26 '21 at 12:39

score 0 · Answer 1 · answered Jan 26 '21 at 11:45

You could define a function something like this one

def find_sentences( word, text ):
    sentences = text.split('.')
    findings = []
    for i in range(len(sentences)):
        if word.lower() in sentences[i].lower():
            if i==0:
                findings.append( sentences[i+1]+'.' )
            elif i==len(sentences)-1:
                findings.append( sentences[i-1]+'.' )
            else:
                findings.append( sentences[i-1]+'.' + sentences[i+1]+'.' )
    return findings

This can then be called as

findings = find_sentences( 'Power curve', Str_wrds )

With some pretty printing

for finding in findings:
print( finding +'\n')

We get the results

However, there is substantial uncertainty linked to power curve measurements as they usually take place only at hub height.

Power curve, supplied by turbine manufacturers, are extensively used in condition monitoring, energy estimation, and improving operational efficiency. Data-driven model accuracy is significantly affected by uncertainty.

The uncertainty associated with models is quantified using confidence intervals (CIs), which are themselves estimated. A radial basis function is taken as the kernel function to improve the accuracy of the SVM models.

The proposed techniques are then verified by extensive 10 min average supervisory control and data acquisition (SCADA) data, obtained from pitch-controlled wind turbines..

which I hope is what you where looking for :)

I am looking for sentences which includes text words also. So something like this: 1 sentence Before > main sentence which contains words> 1 sentence after — Ravi, Jan 26 '21 at 13:03

sushant_padha · Answer 2 · 2021-01-27T09:16:16.303

When you say,

I would like to select before and after 1 sentence whenever Test_wrds found it in a paragraph and list them as a separate string.

I guess you mean that, all the sentences that have one of the words in Test_wrds in them, the sentence before them, and after them, should also be selected.

Function

def remove_sentence(Str_wrds: str, Test_wrds):
    # store all selected sentences
    all_selected_sentences = {}
    # initialize empty dictionary
    for k in Test_wrds:
        # one element for each occurrence
        all_selected_sentences[k] = [''] * Str_wrds.lower().count(k.lower())

    # list of sentences
    sentences = Str_wrds.split(".")

    word_counter = {}.fromkeys(Test_wrds,0)

    for i, sentence in enumerate(sentences):
        for j, word in enumerate(Test_wrds):
            # case insensitive
            if word.lower() in sentence.lower():
                if i == 0:  # first sentence
                    chosen_sentences = sentences[0:2]
                elif i == len(sentences) - 1:  # last sentence
                    chosen_sentences = sentences[-2:]
                else:
                    chosen_sentences = sentences[i - 1:i + 2]

                # get which occurrence of the word is it
                k = word_counter[word]

                all_selected_sentences[word][k] += '.'.join(
                    [s for s in chosen_sentences
                        if s not in all_selected_sentences[word][k]]) + "."

                word_counter[word] += 1  # increment the word counter

    return all_selected_sentences

Running this

answer = remove_sentence(Str_wrds, Test_wrds)
print(answer)

with the provided values for Str_wrds and Test_wrds, returns this output

{
    'Power curve': [
        'Power curve, supplied by turbine manufacturers, are extensively used in condition monitoring, energy estimation, and improving operational efficiency. However, there is substantial uncertainty linked to power curve measurements as they usually take place only at hub height.',
        'Power curve, supplied by turbine manufacturers, are extensively used in condition monitoring, energy estimation, and improving operational efficiency. However, there is substantial uncertainty linked to power curve measurements as they usually take place only at hub height. Data-driven model accuracy is significantly affected by uncertainty.',
        ' The uncertainty associated with models is quantified using confidence intervals (CIs), which are themselves estimated. This study proposes two approaches, namely, pointwise CIs and simultaneous CIs, to measure the uncertainty associated with an SVM-based power curve model. A radial basis function is taken as the kernel function to improve the accuracy of the SVM models.',
        ' The proposed techniques are then verified by extensive 10 min average supervisory control and data acquisition (SCADA) data, obtained from pitch-controlled wind turbines. The results suggest that both proposed techniques are effective in measuring SVM power curve uncertainty, out of which, pointwise CIs are found to be the most accurate because they produce relatively smaller CIs.'
    ],
    'data-driven': [
        ' However, there is substantial uncertainty linked to power curve measurements as they usually take place only at hub height. Data-driven model accuracy is significantly affected by uncertainty. Therefore, an accurate estimation of uncertainty gives the confidence to wind farm operators for improving performance/condition monitoring and energy forecasting activities that are based on data-driven methods.',
        ' Data-driven model accuracy is significantly affected by uncertainty. Therefore, an accurate estimation of uncertainty gives the confidence to wind farm operators for improving performance/condition monitoring and energy forecasting activities that are based on data-driven methods. The support vector machine (SVM) is a data-driven, machine learning approach, widely used in solving problems related to classification and regression.',
        ' Therefore, an accurate estimation of uncertainty gives the confidence to wind farm operators for improving performance/condition monitoring and energy forecasting activities that are based on data-driven methods. The support vector machine (SVM) is a data-driven, machine learning approach, widely used in solving problems related to classification and regression. The uncertainty associated with models is quantified using confidence intervals (CIs), which are themselves estimated.'
    ],
    'wind turbines': [
        ' A radial basis function is taken as the kernel function to improve the accuracy of the SVM models. The proposed techniques are then verified by extensive 10 min average supervisory control and data acquisition (SCADA) data, obtained from pitch-controlled wind turbines. The results suggest that both proposed techniques are effective in measuring SVM power curve uncertainty, out of which, pointwise CIs are found to be the most accurate because they produce relatively smaller CIs.'
    ]
}

Notes:

the function returns a dict of lists
every key is a word in Test_wrds, and list element is an occurrence of the word.
for example, because the word 'power curve' occurs 4 times in the entire text, the value for 'power curve' in the output is a list of 4 elements.

not only before and after but also includes sentence which includes that text_wrds. So something like this: 1 sentence Before > main sentence which contains words> 1 sentence after — Ravi, Jan 26 '21 at 12:52
Can it be more simplified with less codes? maybe using list compression or using other libraries — Ravi, Jan 26 '21 at 12:53
I ran your code in spyder and I m getting three lists 'answer' and basically repeating sentences. — Ravi, Jan 26 '21 at 15:39
I updated my problem for more clarification, plz do have a look at it. In short, I want three strings for every three text_wrds. So 1st string will take before, after as well as the main sentence for 'power curves'. I do not want you to use all these texts_wrds at the same time, u have to use one at a time. Hope this make sense — Ravi, Jan 26 '21 at 17:38
Your updated code is close to what I am looking for. But they are showing in a single string. what I am looking so for each text_wrds, they show me string. I mean for 'power_curve' = answer 1, 'wind turbine'= answer 2 and so on — Ravi, Jan 26 '21 at 17:42
@ sushant_padha Thanks. but your output is wrong. Look at the 'Power curve' output. U have not selected previous and after sentence but instead selected all. To make sense, I attached pdf which explains what I am looking at.Please see the modified question — Ravi, Jan 27 '21 at 08:15
@Ravi, my answer adds all the outputs (in the pdf) as one string. I can make a function that gives **4** outputs, because 'Power curve' occurs 4 times. Or I can make a function that gives **1** output, i.e., adding all the outputs together. It is difficult and complicated to exactly achieve what you want. So please use one of the above listed *partial solutions*. — sushant_padha, Jan 27 '21 at 08:44
@ sushant_padha thank you very much for this. Actually, I am looking for a function that gives me output every time whenever text words occur. So u are right since 'power curve' occurs 4 times, so hoping to get 4 output sentences within 'Power curve' word and similarly to other 'text_wrds'. No I am not looking for a single output. — Ravi, Jan 27 '21 at 09:06
Thanks. But why in 'Power curve' two sentences are the same. See the image doc I added. In the end, you supposed to have three output within 'Power curve'. Because when the word occur you have to select one sentence before and after the sentence. — Ravi, Jan 27 '21 at 09:24
@Ravi, that's what I said. 'power curves' occurs 4 times, so the answer has 4 sentences for power curves. You can either have all (4) sentences or one sentence. I will still try to fix this, but in the general case, this may be a problem — sushant_padha, Jan 27 '21 at 10:21

Sentence selection surrounded to a particular words

2 Answers2

Function

Notes: