Unnest non-consecutive tokens in R

Question

Suppose I have a few sentences describing how John spends his days stored in a dataframe in R:

df <- data_frame(sentence = c("John went to work this morning", "John likes to jog", "John is hungry"))

Thus, I want to identify what words are more often repeated when a sentence contains "John". I can use unnest_tokens() to identify consecutive words. How can I identify recurring pairings that are non consecutive?

The goal is to obtain a result that counts how many times each other word appears close to John:

df2 <- data_frame(word1 = c("John", "John", "John", "John", "John", "John", "John", "John", "John"),
                 word2 = c("went", "to", "work", "this", "morning", "likes", "jog", "is", "hungry"),
                 n = c(1, 2, 1, 1, 1, 1, 1, 1, 1))

In the context of more data, you would perhaps be thinking of [bigram ngram](https://www.r-bloggers.com/2019/08/how-to-create-unigrams-bigrams-and-n-grams-of-app-reviews/) processing. — Chris, Aug 24 '22 at 20:11

score 0 · Answer 1 · answered Aug 24 '22 at 20:41

We can try

library(dplyr)

lst <- lapply(strsplit(df$sentence , " ") , \(x) list(x[1] , x[-1])) |>
       lapply(\(x) data.frame(x[1], x[2]))

ans <- lapply(lst , \(x) {colnames(x) <- c("word1" , "word2") ;x}) |> 
       do.call(rbind , args = _) |> group_by(word1 , word2) |>
       summarise(n = n())

Output

# A tibble: 9 × 3
# Groups:   word1 [1]
  word1 word2       n
  <chr> <chr>   <int>
1 John  hungry      1
2 John  is          1
3 John  jog         1
4 John  likes       1
5 John  morning     1
6 John  this        1
7 John  to          2
8 John  went        1
9 John  work        1

Unnest non-consecutive tokens in R

1 Answers1