Get context around extracted word

Question

I have extracted keywords from a dataframe of sentences. I need to get a few words pre- and post- keyword to understand the context and be able to do some basic counts.

I have tried multiple stringr and stringi functions and grepl functions others suggested on SO for similar questions. However, not finding anything that works for my situation.

Below is what I'd like. Assume it is a dataframe or tibble with the first two fields listed. I need/want to create the rightmost column (keyword_w_context).

In the example, I'm pulling the three words that proceed the keyword. But, I would want to modify whatever solution so I can get 1, 2, n. Would also be nice if I could do post word in the same way.

Basically, wanting to do something like a mutate that creates a new variable with the context words (before/after, see below) around the keyword.

Sentence	Keyword	Keyword_w_context
The yellow lab dog is so cute.	dog	The yellow lab dog
The fluffy black cat purrs loudly.	cat	The fluffy black cat

Many thanks!

score 3 · Accepted Answer · 2021-03-09T22:16:01.217

You probably want to take a natural language processing (NLP) approach rather than something based on regular expressions. There are many frameworks for this. An easy enough one is tidytext. Here is an example on how to grab a bunch of words surrounding your keywords.

You will probably want to play around with this to get what you want. It sounds like you want several things out of this, so I somewhat just picked one.

library(tidytext)
library(dplyr)
library(tibble)

df <- tibble(Sentence = c("The yellow lab dog is so cute.",
                          "The fluffy black cat purrs loudly."))
keywords <- tibble(word = c("dog", "cat"), keyword = TRUE)

df %>% 
  rowid_to_column() %>% 
  unnest_tokens("trigram", Sentence, token = "ngrams", n = 3, n_min = 2) %>%
  unnest_tokens("word", trigram, drop = FALSE) %>% 
  left_join(keywords, by = "word") %>% 
  filter(keyword)

# A tibble: 10 x 4
   rowid trigram          word  keyword
   <int> <chr>            <chr> <lgl>  
 1     1 yellow lab dog   dog   TRUE   
 2     1 lab dog          dog   TRUE   
 3     1 lab dog is       dog   TRUE   
 4     1 dog is           dog   TRUE   
 5     1 dog is so        dog   TRUE   
 6     2 fluffy black cat cat   TRUE   
 7     2 black cat        cat   TRUE   
 8     2 black cat purrs  cat   TRUE   
 9     2 cat purrs        cat   TRUE   
10     2 cat purrs loudly cat   TRUE

An example of how you can build on this is something like as follows. Here you can track what sentence and in what position from the n-gram you found each word. So you can filter where the keyword is the 1st word_pos or whatever.

df %>% 
  rowid_to_column("sentence_id") %>% 
  unnest_tokens("trigram", Sentence, token = "ngrams", n = 3, n_min = 3) %>%
  rowid_to_column("trigram_id") %>% 
  unnest_tokens("word", trigram, drop = FALSE) %>% 
  group_by(trigram_id) %>% 
  mutate(word_pos = row_number()) %>% 
  left_join(keywords, by = "word") %>%
  relocate(sentence_id, trigram_id, word_pos, trigram, word) %>% 
  filter(keyword, word_pos == 1)

# A tibble: 2 x 6
# Groups:   trigram_id [2]
  sentence_id trigram_id word_pos trigram          word  keyword
        <int>      <int>    <int> <chr>            <chr> <lgl>  
1           1          4        1 dog is so        dog   TRUE   
2           2          9        1 cat purrs loudly cat   TRUE

Thank you, Adam. I hadn't thought of a tidytext appraoch. This works perfectly! — Brian Head, Mar 10 '21 at 14:23

jsv · Answer 2 · 2021-03-09T22:09:32.917

0

dat = read.table(text = 'Sentence   | Keyword | Keyword_w_context
The yellow lab dog is so cute.|dog|The yellow lab dog
The fluffy black cat purrs loudly.|cat|The fluffy black cat',sep="|",header=TRUE)

    
n_before = 3
n_after = 2


# Note: This will give an error if you don't have enough words before or after
dat %>% 
  mutate(Keyword_w_context_before = str_extract(string=Sentence,
                                              pattern=paste0("(([A-Za-z]+)\\s){",n_before,"}",Keyword)),
         
         Keyword_w_context_after = str_extract(string=Sentence,
                                               pattern=paste0(Keyword,"(\\s([A-Za-z]+)){",n_after,"}"))
         )


                            Sentence Keyword    Keyword_w_context Keyword_w_context_before Keyword_w_context_after
1     The yellow lab dog is so cute.     dog   The yellow lab dog       The yellow lab dog               dog is so
2 The fluffy black cat purrs loudly.     cat The fluffy black cat     The fluffy black cat        cat purrs loudly

edited Mar 09 '21 at 22:09

answered Mar 09 '21 at 21:38

jsv

740
3
5

Thank you, jvargh7. This does match what is in the table. But, as I noted in the text of my question, I need to be able to modify it to be 2 words before, 3 words before, etc. of the keyword. Right now, this is pulling everything before the keyword. Is there an easy edit to make that happen? – Brian Head Mar 09 '21 at 21:45
Could you try now? Change the 'n' for different results – jsv Mar 09 '21 at 21:59
Thank you again, jvargh7. I appreciate you taking the time to try to help me find a solution. I wasn't able to get the code to work. It worked on some rows, but not others. I didn't follow why. Again, I appreciate it. – Brian Head Mar 10 '21 at 14:23
No problem @BrianHead. Adam's solution is much better. – jsv Mar 10 '21 at 19:01

Get context around extracted word

2 Answers2