1

I'm new to R, and I'm using widyr to do text mining. I successfully used the methods found here to get a list of co-occurring words within each section of text and their phi coefficient.

Code as follows:

word_cors <- review_words %>%
  group_by(word) %>%
  pairwise_cor(word, title, sort = TRUE) %>%
  filter(correlation > .15)

I understand that I can also generate a data frame with co-occurring words and the number of times they appear, using code like:

word_pairs <- review_words %>%
  pairwise_count(word, title, sort = TRUE)

What I need is a table that has both the phi coefficient and the number of occurrences for each pair of words. I've been digging into pairwise_cor and pairwise_count but still can't figure out how to combine them. If I understand correctly, joins only take one column into account for matching, so I couldn't use a regular join reliably since there may be multiple pairs that have the same word in the item1 column.

Is this possible using widyr? If not, is there another package that will allow me to do this?

Here is the full code:

#Load packages
pacman::p_load(XML, dplyr, stringr, rvest, httr, xml2, tidytext, tidyverse, widyr)

#Load source material
prod_reviews_df <- read_csv("SOURCE SPREADSHEET.csv")

#Split into one word per row
review_words <- prod_reviews_df %>%
  unnest_tokens(word, comments, token = "words", format = "text", drop = FALSE) %>%
  anti_join(stop_words, by = c("word" = "word"))

#Find phi coefficient
word_cors <- review_words %>%
  group_by(word) %>%
  pairwise_cor(word, title, sort = TRUE) %>%
  filter(correlation > .15)

#Write data to CSV
write.csv(word_cors, "WORD CORRELATIONS.csv")

I want to add in pairwise_count, but I need it alongside the phi coefficient.

Thank you!

ElizabethW
  • 13
  • 5
  • I'm confused because you are using different data for the examples but the question sounded like you wanted to get two statistics from the same data. Can you please clarify? Also joins are not limited to one column although i guess it could depend on the package you use. – Elin Sep 19 '17 at 23:38
  • Hi Elin, sorry for the confusion. I am not actually using the pairwise_count function in my code, so I just copy-and-pasted a pairwise_count example from the instructions I was using. I want to add it in to my code, but I only want to add it as a column attached to the word pairs and phi coefficient, which I am getting from the pairwise_cor function. I can't figure out how to do that and haven't been able to find any instructions. I will edit my post for clarity. Also, the joins I was looking at are from dplyr. I will look into other packages. – ElizabethW Sep 20 '17 at 17:32

2 Answers2

3

If you are getting into using tidy data principles and tidyverse tools, I would suggest GOING ALL THE WAY :) and using dplyr to do the joins you are interested in. You can use left_join to connect the calculations from pairwise_cor() and pairwise_count(), and you can just pipe from one to the other, if you like.

library(dplyr)
library(tidytext)
library(janeaustenr)
library(widyr)

austen_section_words <- austen_books() %>%
  filter(book == "Pride & Prejudice") %>%
  mutate(section = row_number() %/% 10) %>%
  filter(section > 0) %>%
  unnest_tokens(word, text) %>%
  filter(!word %in% stop_words$word)

austen_section_words %>%
  group_by(word) %>%
  filter(n() >= 20) %>%
  pairwise_cor(word, section, sort = TRUE) %>%
  left_join(austen_section_words %>%
              pairwise_count(word, section, sort = TRUE),
            by = c("item1", "item2"))

#> # A tibble: 154,842 x 4
#>        item1     item2 correlation     n
#>        <chr>     <chr>       <dbl> <dbl>
#>  1    bourgh        de   0.9508501    29
#>  2        de    bourgh   0.9508501    29
#>  3    pounds  thousand   0.7005808    17
#>  4  thousand    pounds   0.7005808    17
#>  5   william       sir   0.6644719    31
#>  6       sir   william   0.6644719    31
#>  7 catherine      lady   0.6633048    82
#>  8      lady catherine   0.6633048    82
#>  9   forster   colonel   0.6220950    27
#> 10   colonel   forster   0.6220950    27
#> # ... with 154,832 more rows
Julia Silge
  • 10,848
  • 2
  • 40
  • 48
0

I discovered and used merge today, and it appears to have used both relevant columns to merge the data. I'm not sure how to check for accuracy, but I think it worked.

ElizabethW
  • 13
  • 5