Gather function in R dropping column

Question

I'm comparing the language used by some authors with data downloaded from the Project Gutenberg site but I'm having some trouble with my tibble manipulation. My end goal is to make a plot comparing frequency of word usage by Herman Melville and Lewis Carroll compared to Washington Irving. However, my tibble doesn't have an Irving column which is problematic when I then attempt to call it in my ggplot.

I'm expecting my frequency tibble to look like

# A tibble: 72,984 x 4
   word             Irving     author     proportion
   <chr>             <dbl>     <chr>        <dbl>
1 a'dale          0.00000907   Melville   NA        
 2 aa             NA           Melville   0.0000246
 3 ab             NA           Melville   NA        
 4 aback          NA           Melville   0.0000369
 5 abana          NA           Melville   0.0000123
 6 abandon        0.0000363    Melville   0.0000861
 7 abandoned      0.000163     Melville   0.000172 
 8 abandoning     0.0000181    Melville   NA        
 9 abandonment    0.00000907   Melville   0.0000123
10 abasement      0.0000181    Melville   0.0000123
# ... with 72,974 more rows

but instead it looks like

# A tibble: 72,984 x 3
   word        author   proportion
   <chr>       <chr>         <dbl>
 1 a'dale      Melville NA        
 2 aa          Melville  0.0000246
 3 ab          Melville NA        
 4 aback       Melville  0.0000369
 5 abana       Melville  0.0000123
 6 abandon     Melville  0.0000861
 7 abandoned   Melville  0.000172 
 8 abandoning  Melville NA        
 9 abandonment Melville  0.0000123
10 abasement   Melville  0.0000123
# ... with 72,974 more rows

and I'm not sure what I'm doing wrong when I gather to make the frequency tibble.

Code

# Import libraries
library(tidyverse) # dplyr, tidyr, stringr, ggplot2
library(tidytext)
library(gutenbergr)

# Download four works from each author
wirving <- gutenberg_download(c(49872, 41, 14228, 13514)) 
hmelville <- gutenberg_download(c(15, 4045, 28656, 2694))
lcarroll <- gutenberg_download(c(19033, 620, 12, 4763))

# tidy each author
tidy_wirving <- wirving %>%
  unnest_tokens(word, text) %>%
  mutate(word = str_extract(word, "[a-z']+")) %>%
  anti_join(stop_words, by = "word")

tidy_hmelville <- hmelville %>%
  unnest_tokens(word, text) %>%
  mutate(word = str_extract(word, "[a-z']+")) %>%
  anti_join(stop_words, by = "word")

tidy_lcarroll <- lcarroll %>%
  unnest_tokens(word, text) %>%
  mutate(word = str_extract(word, "[a-z']+")) %>%
  anti_join(stop_words, by = "word")

# calculate word frequency
frequency_by_word_across_authors <- 
  bind_rows(mutate(tidy_wirving, author = "Irving"),
            mutate(tidy_hmelville, author = "Melville"),
            mutate(tidy_lcarroll, author = "Carroll")) %>%
  mutate(word = str_extract(word, "[a-z']+")) %>%
  count(author, word) %>%
  group_by(author) %>%
  mutate(proportion = n /sum(n)) %>%
  select(-n) %>%
  spread(author, proportion)

# compare frequency of Melville and Carroll against Irving
frequency <- frequency_by_word_across_authors %>%
  gather(author, proportion,`Melville`:`Carroll`)

ggplot(frequency,
       aes(x = proportion,
           y =`Irving`,
           color = abs(`Irving`- proportion))) +
  geom_abline(color = "gray40", 
              lty = 2) +
  geom_jitter(alpha = 0.1, 
              size = 2.5,
              width = 0.3, 
              height = 0.3) +
  geom_text(aes(label = word),
            check_overlap = TRUE, 
            vjust = 1.5) +
  scale_x_log10(labels = percent_format()) +
  scale_y_log10(labels = percent_format()) +
  scale_color_gradient(limits = c(0, 0.001),
                       low = "darkslategray4",
                       high = "gray75") +
  facet_wrap(~author, ncol = 2) +
  theme(legend.position="none") +
  labs(y = "Irving Washington", x = NULL)

# Error in FUN(X[[i]], ...) : object 'Irving' not found

In general, please provide a complete and reproducible question: in the case here, be clear about which non-base packages you are using. I suspect `gutenbergr`, `dplyr`, `tidyr`, and `ggplot2`. (Since the question is not about `gutenbergr`, I suggest you could remove the code that fetches data, and instead provide us a small sample of data, preferably using `dput(head(...))` on whichever data is strictly required to demonstrate plotting more than one author. — r2evans, Oct 31 '19 at 16:09
I haven't run the code, but `y=\`Irving\`` seems like you are looking for a level within `frequency$author`. Are you intending to subset instead? — r2evans, Oct 31 '19 at 16:10
@r2evans apologies, added imports to my code. And no, y=`Irving` refers to the Irving column which existed in `frequency_by_word_across_authors` but disappears when I create `frequency`. — carousallie, Oct 31 '19 at 16:15
Have you tried something like `gather(author, proportion, -\`Irving\`)`? It seems you need to gather all others *except* `Irving`, and it's not clear to me in what order the columns are listed in the spread-frame. (I don't have `gutenbergr` or `tidytext` installed, so I cannot test anything without seeing one of `tidy_wirving` and friends.) — r2evans, Oct 31 '19 at 16:33
I'm not downloading that data, but probably you just need to replace `gather(author, proportion,Melville:Carroll)` with ` gather(author, proportion, Melville, Carroll)` — Axeman, Oct 31 '19 at 16:57

score 1 · Accepted Answer · answered Nov 02 '19 at 00:10

The issue is how you are using gather(); the two columns that you want to gather are not next to each other so you don't want to use ::

frequency <- frequency_by_word_across_authors %>%
  gather(author, proportion, Carroll, Melville)


ggplot(frequency,
       aes(x = proportion,
           y = Irving,
           color = abs(Irving - proportion))) +
  geom_abline(color = "gray40", 
              lty = 2) +
  geom_jitter(alpha = 0.1, 
              size = 2.5,
              width = 0.3, 
              height = 0.3) +
  geom_text(aes(label = word),
            check_overlap = TRUE, 
            vjust = 1.5) +
  scale_x_log10(labels = percent_format()) +
  scale_y_log10(labels = percent_format()) +
  scale_color_gradient(limits = c(0, 0.001),
                       low = "darkslategray4",
                       high = "gray75") +
  facet_wrap(~author, ncol = 2) +
  theme(legend.position="none") +
  labs(y = "Irving Washington", x = NULL)

^{Created on 2019-11-01 by the reprex package (v0.3.0)}

Gather function in R dropping column

1 Answers1