Find close words in many articles in R

Question

I have a tibble table (mydf) (100 rows by 5 columns). Articles are made up of many paragraphs.

ID<-c(1,2)
Date<-c("31/01/2018","15/02/2018") 

article1<-c("This is the first article. It is not long. It is not short. It 
comprises of many words and many sentences. This ends paragraph one.  
Parapraph two starts here. It is just a continuation.")

article2<-c("This is the second article. It is longer than first article by 
number of words. It also does not communicate anyything of value. Reading it 
can put you to sleep or jumpstart your imagination. Let your imagination 
take you to some magical place. Enjoy the ride.")

Articles<-c(article1,article2)

FirstWord<-c("first","starts")
SecondWord<-c("jumpstart","magical")

mydf<-tibble(ID,Date, FirstWord,SecondWord,Articles)

ID    Date    FirstWord    SecondWord    Articles
 1    xxxx     xxx           xxx          xxx
 2     etc
 3     etc

I want to add new column to table, which gives me TRUE/FALSE if the distance between FirstWord is close to SecondWord in Article by 30 word spaces.

ID    Date    FirstWord    SecondWord    Articles   distance
 1    xxxx     xxx           xxx          xxx        TRUE
 2     etc                                           FALSE
 3     etc

I have followed this example in StackOverflow to calculate distances - How to calculate proximity of words to a specific term in a document

library(tidytext)
library(dplyr)

all_words <- mydf %>%
unnest_tokens(word, Articles) %>%
mutate(position = row_number()) 

library(fuzzyjoin)

nearby_words <- all_words %>%
filter(word == FirstWord) %>%
select(focus_term = word, focus_position = position) %>%
difference_inner_join(all_words, by = c(focus_position = "position"), max_dist = 30) %>%
mutate(distance = abs(focus_position - position))

I get table like this:

  focus_term   focus_position  ID    Date    FirstWord    SecondWord   word  position

How do I get results in this format:

ID    Date    FirstWord    SecondWord    Articles   distance
 1    xxxx     xxx           xxx          xxx        TRUE
 2     etc                                           FALSE
 3     etc

Appreciate your help :)

can you explain it more, like you are not getting columns such as articles, first word, second word etc in resulting dataframe ? — Mustufain, Feb 20 '18 at 11:35
@Mustufain - yes when I tokenise the **Articles** into **words**, the resulting table excludes the **Articles** column. I want to include this column and the distance. — Beginner, Feb 20 '18 at 12:41
You should probably provide a reproducible example of your data. `dput(head(mydf))` and post the result here. — csgroen, Feb 20 '18 at 12:57
@Beginner since I cannot reproduce your example at my end, one thing you can do is just mutate a new column like this mutate(new_col=Articles) before tokenising the words into articles. maybe it can be happening that it is transforming the original article column into words so you are not able to recover original article column. — Mustufain, Feb 20 '18 at 13:50
@Mustufain - I have added a reproducible example herein. Look forward to your wonderful help. — Beginner, Feb 21 '18 at 04:18

score 2 · Accepted Answer · answered Feb 21 '18 at 08:14

Since you are tokenizing the Article column, so it us transformed into words column, inorder to get the origional Article column just mutate it to a new column (lets say new_column) before tokenizing. In nearby_words I have just selected the column you want in the output. Moreover I have also added boolean value with distance if it is equal to 30 or not.

mydf<-tibble(ID,Date, FirstWord,SecondWord,Articles)
        all_words <- mydf %>%
          mutate(new_column=Articles) %>%
          unnest_tokens(word, Articles) %>%
          mutate(position = row_number())

    nearby_words <- all_words %>%
      filter(word == FirstWord) %>%
      select(focus_term = word, focus_position = position) %>%
      difference_inner_join(all_words, by = c(focus_position = "position"), max_dist = 30) %>%
     mutate(distance = abs(focus_position - position)) %>%
     mutate(distance=ifelse(distance == 30,yes=TRUE,no=FALSE)) %>%
     select(ID,Date,FirstWord,SecondWord,new_column,distance)

-the above finds the position of FirstWord, and calculates difference with word. How do I find the difference between Firstword position and SecondWord position, and then output as select(ID,Date,FirstWord,SecondWord,new_column,distance). Thank you — Beginner, Mar 26 '18 at 08:35

Find close words in many articles in R

1 Answers1