How Can I Bind and Graph Two Books for Similarity of Word Frequency?

Question

I am using Text Mining with R: A Tidy Approach by Julia Silge & David Robinson to try to bind and graph two books, the first by Jane Austen (Persuasion, for which read "persua"), the second by Charlotte Bronte (for which read "janeyre"), in order to compare them and graph them, according to word frequency. Afterwards, I would like to add the complete corpus of 6 books by Austen and 4 books by Charlotte Bronte because I am trying to understand the CONSISTENCY OVER TIME OF AUTHORIAL IDIOLECT.

I have tried to modify some of the code found in the first chapter of Silge & Robinson's book in order to do this, from the section on word frequency.

library(tidyverse)
library(dplyr)
library(gutenbergr)
library(tidytext)
library(stringr)
library(ggplot2)

persua <- gutenberg_download(c(105))
tidy_persua <- persua %>%
unnest_tokens(word, text) %>%
count(word, sort = TRUE) 

library(dplyr)
library(gutenbergr)
library(tidytext)
janeyre<- gutenberg_download(c(1260))
tidy_janeyre <- janeyre%>%
unnest_tokens(word, text) %>%
count(word, sort = TRUE) 

library(tidyr)

frequency <- bind_rows(mutate(tidy_persua, author = "Jane Austen"),                              
                      (mutate(tidy_janeyre, author = "Charlotte Bronte")) %>%   
mutate(word = str_extract(word, "[a-z']+")) %>%
count(author, word) %>%
group_by(author) %>%
mutate(proportion = n / sum(n)) %>%

select(-n) %>% spread(author, proportion) %>% gather(author, proportion, Charlotte Bronte : Jane Austen)

library(scales)
ggplot(frequency, aes(x = proportion, y =  `Jane Austen`, 
                  color = abs(`Jane Austen` - proportion))) +
geom_abline(color = "gray40", lty = 2) +
geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
scale_x_log10(labels = percent_format()) +
scale_y_log10(labels = percent_format()) +
scale_color_gradient(limits = c(0, 0.001), 
                   low = "darkslategray4", high = "gray75") +
facet_wrap(~author, ncol = 2) +
theme(legend.position="none") +
labs(y = "Jane Austen", x = NULL)

However, in the last sequence of codes I get a number of error messages that simply confuse me because I don't have enough experience of working with code in R.

Okay. Here it is again. – Terence Murphy Dec 04 '20 at 01:43 — Terence Murphy, Dec 04 '20 at 01:43

Andrew Brown · Accepted Answer · 2020-12-04T15:39:34.627

You do not really have a question or an example of what the expected output is supposed to be. And I am not sure what the best practices are for calculating proportions for the types of comparisons you are attempting here... but the following reprex shows something that appears to work, with relatively minor modifications of the code you posted.

My take calculates proportion as the number of occurrences of a word by a single author relative to the total number of times the two authors used that word.

Notably it does not look like you were getting proper word counts. I am using pivot_wider(values_fn = sum) to deal with some duplication that is present after calculating the proportions

library(tidyverse)
library(dplyr)
library(gutenbergr)
library(tidytext)
library(stringr)
library(tidyr)
library(ggplot2)
library(scales, warn.conflicts=FALSE)

persua <- gutenberg_download(c(105))
#> Determining mirror for Project Gutenberg from http://www.gutenberg.org/robot/harvest
#> Using mirror http://aleph.gutenberg.org

tidy_persua <- persua %>%
  unnest_tokens(word, text) %>%
  count(word, sort = TRUE) 

janeyre <- gutenberg_download(c(1260))

tidy_janeyre <- janeyre %>%
  unnest_tokens(word, text) %>%
  count(word, sort = TRUE) 

frequency <- bind_rows(mutate(tidy_persua, author = "Jane Austen"),
                       mutate(tidy_janeyre, author = "Charlotte Bronte")) %>% 
                         mutate(word = str_extract(word, "[a-z']+")) %>%
                         group_by(word) %>%
                         mutate(proportion = n / sum(n)) %>%
                         select(-n) %>%
                         pivot_wider(names_from = author, values_from = proportion, values_fn = sum) 

ggplot(frequency, aes(x = `Charlotte Bronte`, y =  `Jane Austen`, 
                      color = abs(`Jane Austen` - `Charlotte Bronte`))) +
  geom_abline(color = "gray40", lty = 2) +
  geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
  scale_x_log10(labels = percent_format()) +
  scale_y_log10(labels = percent_format()) +
  scale_color_gradient(limits = c(0, 0.001), 
                       low = "darkslategray4", high = "gray75") +
  theme(legend.position="none") +
  labs(y = "Jane Austen", x = "Charlotte Bronte")

EDIT: Here is a working version that is a direct clone of the https://www.tidytextmining.com/tidytext.html example -- but applied to your two books. It does not produce the most exciting looking graph.

frequency <- bind_rows(mutate(tidy_persua, author = "Jane Austen"),
                       mutate(tidy_janeyre, author = "Charlotte Bronte")) %>% 
  mutate(word = str_extract(word, "[a-z']+")) %>%
  count(author, word) %>%
  group_by(author) %>%
  mutate(proportion = n / sum(n)) %>% 
  select(-n) %>% 
  spread(author, proportion) %>% 
  gather(author, proportion, `Charlotte Bronte`)

ggplot(frequency, aes(x = proportion, y = `Jane Austen`, 
                      color = abs(`Jane Austen` - proportion))) +
  geom_abline(color = "gray40", lty = 2) +
  geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
  scale_x_log10(labels = percent_format()) +
  scale_y_log10(labels = percent_format()) +
  scale_color_gradient(limits = c(0, 0.001), 
                       low = "darkslategray4", high = "gray75") +
  facet_wrap(~author, ncol = 2) +
  theme(legend.position="none") +
  labs(y = "Jane Austen", x = NULL)

Here is the link. If you scroll down to "Word Frequencies", you will be able to see what Selige and Robinson present. — Terence Murphy, Dec 04 '20 at 06:48
But this looks very good! Thank you very much! Let me try it out and see how it works. — Terence Murphy, Dec 04 '20 at 06:49
I am curious when you say that I wasn't getting accurate word counts. How can you tell? Is this because I have been using Gutenberg and inadvertently including the bumph at beginning and end of the documents? By the way, the code works beautifully! Thanks very much! — Terence Murphy, Dec 04 '20 at 07:42
I think you had `group_by(author)` which was only counting the instances of word records, not the actual number of word. I think the raw tables e.g. `tidy_persua` were fine, but the group_by and count resulted in the n being number of records (unique words) not number of words (total). Not sure if that makes sense, but the net result was most words were only "occurring" once and the authors were 50/50 for tons of them. — Andrew Brown, Dec 04 '20 at 14:53
No, that does make sense. Thanks again for all your help. (By the way, can I "thank" you on StackOverflow? How does one do this?) — Terence Murphy, Dec 04 '20 at 21:57
If you mark my answer as "the" answer then I will get some reputation points! — Andrew Brown, Dec 09 '20 at 19:55

How Can I Bind and Graph Two Books for Similarity of Word Frequency?

1 Answers1