2

I am trying to search a large text in R for keywords. Once I find one, I want to extract the 1 sentence before and after that keyword (including the sentence with the keyword in it). Ideally, I would like to be able to change this code to extract up to 3 sentences around the keyword. Sample data below.

text <- "This is an article about random things. Usually, there are a few sentences that are irrelevant to what I am interested in. Then in the middle, there is a sentence that I want to extract. Water quality is a serious concern in Akron, Ohio. It can impact ecological systems and human health. Jon Doe is a key player in this realm. Then the article goes on talking about something else that I don't care about."

keywords <- c("water quality", "health")

So with the text above, I want to search the text for "water quality" and "health" and when there is a match, I want to extract from "Then in the middle there is..." to "Jon Doe is a key player in this realm."

Finally, I want to repeat this over a number of rows with each row having its own text.

I've looked into using stringr/regex but it's not giving me what I want- I can't pull the full sentences. Any ideas?

Code I've tried:

str_extract_all(text,paste0("([^\\s+\\s){5}",keywords,"(\\s[^\\s]+){5}"))

-> that gets me a few words on either side

gsub(".*?([^\\.]*('water quality'|health)[^\\.]*).*","\\1", text, ignore.case = TRUE)

-> close also

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
Klasic
  • 31
  • 7
  • It can be hard for a computer to know what a "sentence" is. Can you assume that all sentence are separated by periods? This works well as long as you don't have any periods in titles or abbreviations. Some how you need to be able to tell the computer what you want without out expecting the computer to be able to read and understand English. – MrFlick Mar 03 '21 at 00:26
  • Sounds interesting! Perhaps [this](https://stackoverflow.com/a/47646957/8402369) will help you get started. It can find a sentence with a word in it, from an array of split sentences. You should be able to take the index before and after the sentence in that array to get your desired results. – marsnebulasoup Mar 03 '21 at 00:28
  • @MrFlick - yes, reasonable to assume sentences are separated by periods - the text are essentially news articles. – Klasic Mar 03 '21 at 00:33
  • @marsnebulasoup - yes I've read that one- thanks. The problem is - that one ONLY extracts the sentence containing the word. I want to extract sentences around it so I can get context (because in the example above, John Doe comes up in a sentence that doesn't use a keyword but is still related). – Klasic Mar 03 '21 at 00:34
  • Well, if you've already split your text into sentences per the other question, then you can find matches in a vector in a window using this existing answer: https://stackoverflow.com/questions/52047002/select-n-rows-above-and-below-match – MrFlick Mar 03 '21 at 00:36
  • @MrFlick - that has me completely lost I'm afraid, do you mind explaining further how to apply it? I used the text <- unlist(strsplit(text, "\\.")) which separates my sentences but from there I can't figure out the nesting/grabbing surrounding sentences – Klasic Mar 03 '21 at 00:41
  • @Klasic - that's what I meant. If you split your text into a list of sentences, and find the indices of the list that contain sentences with the words you're looking for (with my linked answer), you just have to take the sentence at `index - 1` and `index + 1` to get your surrounding sentences. Note that if the sentence found is at index 0, then you have to handle that, because there wouldn't be any preceding sentences then. The same goes with the last index of the list. – marsnebulasoup Mar 03 '21 at 00:41
  • 1
    @marsnebulasoup - ahhh ok. I think I understand now. I'm going to give that a try-- my indexing wasn't working for some reason. Thanks- will follow-up. – Klasic Mar 03 '21 at 00:44

2 Answers2

3

Use keywords to create a pattern to look for, put the data in a tibble, separate them out in sentences (splitting on period) and select n-1, n and n+1 row for every n row where the pattern is found.

library(dplyr)
library(tidyr)

keywords <- c("water quality", "health")
pat <- paste0(keywords, collapse = '|')
pat
#[1] "water quality|health"

tibble(text) %>%
  separate_rows(text, sep = '\\.\\s*') %>%
  slice({
    tmp <- grep(pat, text, ignore.case = TRUE)
    sort(unique(c(tmp-1, tmp, tmp + 1)))
  })

#  text                                                          
#  <chr>                                                         
#1 Then in the middle, there is a sentence that I want to extract
#2 Water quality is a serious concern in Akron, Ohio             
#3 It can impact ecological systems and human health             
#4 Jon Doe is a key player in this realm       
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
2

This can be done with a regular expression.

for(kw in keywords) {
    Pat <- paste(".*?(([^.]+\\.){0,1}[^.]+", kw, ".*?\\.(.*?\\.){0,1}).*", sep="")
    print(sub(Pat, "\\1", text, ignore.case=T))
}
[1] " Then in the middle, there is a sentence that I want to extract. Water quality is a serious concern in Akron, Ohio. It can impact ecological systems and human health."
[1] " Water quality is a serious concern in Akron, Ohio. It can impact ecological systems and human health. Jon Doe is a key player in this realm."

Some details about the regex. This works the same for each keyword. I will use the second one "health" as my example. If you print out the pattern Pat, you get

".*?(([^.]+\\.){0,1}[^.]+health.*?\\.(.*?\\.){0,1}).*"

What is this doing? The sub statement will replace whatever is matched with the contents of \1 - the first capture group, the stuff inside the first set of parentheses. Let's look at pieces of this.

To get the sentence containing the keyword "health" we have [^.]+health.*?\\. this matches any number of characters other than period followed by health followed by any number of characters up to the next period. To get the sentence after the sentence with health, we add (.*?\\.){0,1} That means any character up to and including the next period. But what if there is no sentence after the health sentence? That is why I wrote {0,1} to make the next sentence optional. Similarly, we include (.*?\\.){0,1} in front of the part that captures the "health" sentence to get an optional sentence before the health sentence. All of this is in parentheses to make it a capture group - the first capture group, the one that gets stored in \1. That matches the part that we want, but what about the rest? We want to get rid of everything else, so we put .*? in front and .* at the end so that the rest of the text will be matched. Now the pattern matches the entire string, but sub replaces it with the part that we want. If you want two sentences before and after the keyword sentence, just replace {0,1} with {0,2}.

G5W
  • 36,531
  • 10
  • 47
  • 80
  • thank you so much for the explanation - this is great. I am working on implementing it now and will follow-up with questions. Seriously, thank you- been banging my head against the wall. – Klasic Mar 03 '21 at 01:50