0

I'm comparing the language used by some authors with data downloaded from the Project Gutenberg site but I'm having some trouble with my tibble manipulation. My end goal is to make a plot comparing frequency of word usage by Herman Melville and Lewis Carroll compared to Washington Irving. However, my tibble doesn't have an Irving column which is problematic when I then attempt to call it in my ggplot.

I'm expecting my frequency tibble to look like

# A tibble: 72,984 x 4
   word             Irving     author     proportion
   <chr>             <dbl>     <chr>        <dbl>
1 a'dale          0.00000907   Melville   NA        
 2 aa             NA           Melville   0.0000246
 3 ab             NA           Melville   NA        
 4 aback          NA           Melville   0.0000369
 5 abana          NA           Melville   0.0000123
 6 abandon        0.0000363    Melville   0.0000861
 7 abandoned      0.000163     Melville   0.000172 
 8 abandoning     0.0000181    Melville   NA        
 9 abandonment    0.00000907   Melville   0.0000123
10 abasement      0.0000181    Melville   0.0000123
# ... with 72,974 more rows

but instead it looks like

# A tibble: 72,984 x 3
   word        author   proportion
   <chr>       <chr>         <dbl>
 1 a'dale      Melville NA        
 2 aa          Melville  0.0000246
 3 ab          Melville NA        
 4 aback       Melville  0.0000369
 5 abana       Melville  0.0000123
 6 abandon     Melville  0.0000861
 7 abandoned   Melville  0.000172 
 8 abandoning  Melville NA        
 9 abandonment Melville  0.0000123
10 abasement   Melville  0.0000123
# ... with 72,974 more rows

and I'm not sure what I'm doing wrong when I gather to make the frequency tibble.

Code

# Import libraries
library(tidyverse) # dplyr, tidyr, stringr, ggplot2
library(tidytext)
library(gutenbergr)

# Download four works from each author
wirving <- gutenberg_download(c(49872, 41, 14228, 13514)) 
hmelville <- gutenberg_download(c(15, 4045, 28656, 2694))
lcarroll <- gutenberg_download(c(19033, 620, 12, 4763))

# tidy each author
tidy_wirving <- wirving %>%
  unnest_tokens(word, text) %>%
  mutate(word = str_extract(word, "[a-z']+")) %>%
  anti_join(stop_words, by = "word")

tidy_hmelville <- hmelville %>%
  unnest_tokens(word, text) %>%
  mutate(word = str_extract(word, "[a-z']+")) %>%
  anti_join(stop_words, by = "word")

tidy_lcarroll <- lcarroll %>%
  unnest_tokens(word, text) %>%
  mutate(word = str_extract(word, "[a-z']+")) %>%
  anti_join(stop_words, by = "word")

# calculate word frequency
frequency_by_word_across_authors <- 
  bind_rows(mutate(tidy_wirving, author = "Irving"),
            mutate(tidy_hmelville, author = "Melville"),
            mutate(tidy_lcarroll, author = "Carroll")) %>%
  mutate(word = str_extract(word, "[a-z']+")) %>%
  count(author, word) %>%
  group_by(author) %>%
  mutate(proportion = n /sum(n)) %>%
  select(-n) %>%
  spread(author, proportion)

# compare frequency of Melville and Carroll against Irving
frequency <- frequency_by_word_across_authors %>%
  gather(author, proportion,`Melville`:`Carroll`)

ggplot(frequency,
       aes(x = proportion,
           y =`Irving`,
           color = abs(`Irving`- proportion))) +
  geom_abline(color = "gray40", 
              lty = 2) +
  geom_jitter(alpha = 0.1, 
              size = 2.5,
              width = 0.3, 
              height = 0.3) +
  geom_text(aes(label = word),
            check_overlap = TRUE, 
            vjust = 1.5) +
  scale_x_log10(labels = percent_format()) +
  scale_y_log10(labels = percent_format()) +
  scale_color_gradient(limits = c(0, 0.001),
                       low = "darkslategray4",
                       high = "gray75") +
  facet_wrap(~author, ncol = 2) +
  theme(legend.position="none") +
  labs(y = "Irving Washington", x = NULL)

# Error in FUN(X[[i]], ...) : object 'Irving' not found
r2evans
  • 141,215
  • 6
  • 77
  • 149
carousallie
  • 776
  • 1
  • 7
  • 25
  • In general, please provide a complete and reproducible question: in the case here, be clear about which non-base packages you are using. I suspect `gutenbergr`, `dplyr`, `tidyr`, and `ggplot2`. (Since the question is not about `gutenbergr`, I suggest you could remove the code that fetches data, and instead provide us a small sample of data, preferably using `dput(head(...))` on whichever data is strictly required to demonstrate plotting more than one author. – r2evans Oct 31 '19 at 16:09
  • I haven't run the code, but `y=\`Irving\`` seems like you are looking for a level within `frequency$author`. Are you intending to subset instead? – r2evans Oct 31 '19 at 16:10
  • @r2evans apologies, added imports to my code. And no, y=`Irving` refers to the Irving column which existed in `frequency_by_word_across_authors` but disappears when I create `frequency`. – carousallie Oct 31 '19 at 16:15
  • Have you tried something like `gather(author, proportion, -\`Irving\`)`? It seems you need to gather all others *except* `Irving`, and it's not clear to me in what order the columns are listed in the spread-frame. (I don't have `gutenbergr` or `tidytext` installed, so I cannot test anything without seeing one of `tidy_wirving` and friends.) – r2evans Oct 31 '19 at 16:33
  • 1
    I'm not downloading that data, but probably you just need to replace `gather(author, proportion,Melville:Carroll)` with ` gather(author, proportion, Melville, Carroll)` – Axeman Oct 31 '19 at 16:57

1 Answers1

1

The issue is how you are using gather(); the two columns that you want to gather are not next to each other so you don't want to use ::

frequency <- frequency_by_word_across_authors %>%
  gather(author, proportion, Carroll, Melville)


ggplot(frequency,
       aes(x = proportion,
           y = Irving,
           color = abs(Irving - proportion))) +
  geom_abline(color = "gray40", 
              lty = 2) +
  geom_jitter(alpha = 0.1, 
              size = 2.5,
              width = 0.3, 
              height = 0.3) +
  geom_text(aes(label = word),
            check_overlap = TRUE, 
            vjust = 1.5) +
  scale_x_log10(labels = percent_format()) +
  scale_y_log10(labels = percent_format()) +
  scale_color_gradient(limits = c(0, 0.001),
                       low = "darkslategray4",
                       high = "gray75") +
  facet_wrap(~author, ncol = 2) +
  theme(legend.position="none") +
  labs(y = "Irving Washington", x = NULL)

Created on 2019-11-01 by the reprex package (v0.3.0)

Julia Silge
  • 10,848
  • 2
  • 40
  • 48