3

This is a bizarre puzzle. I downloaded 2 texts from gutenbergr - Alice in Wonderland and Ulysses. The stop_words disappear from Alice but they are still in Ulysses. This issue persisted even when replacing anti_join with filter (!word %in% stop_words$word).

How do I get the stop_words out of Ulysses?

Thanks for your help!

Plot of top 15 tf_idf for Alice & Ulysses

library(gutenbergr)
library(dplyr)
library(stringr)
library(tidytext)
library(ggplot2)

titles <- c("Alice's Adventures in Wonderland", "Ulysses")


books <- gutenberg_works(title %in% titles) %>%
  gutenberg_download(meta_fields = c("title", "author"))


data(stop_words)


tidy_books <- books %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words) %>%
  count(title, word, sort=TRUE) %>%
  ungroup()


plot_tidy_books <- tidy_books %>%
  bind_tf_idf(word, title, n) %>%
  arrange(desc(tf_idf))       %>%
  mutate(word = factor(word, levels = rev(unique(word)))) %>%
  mutate(title = factor(title, levels = unique(title)))


plot_tidy_books %>%
  group_by(title) %>%
  arrange(desc(n))%>%
  top_n(15, tf_idf) %>%
  mutate(word=reorder(word, tf_idf)) %>%
  ggplot(aes(word, tf_idf, fill=title)) +
  geom_col(show.legend = FALSE) +
  labs(x=NULL, y="tf-idf") +
  facet_wrap(~title, ncol=2, scales="free") +
  coord_flip()

1 Answers1

4

After a bit of digging in the tokenized Ulysses, the text "it's" is actually using a right single quotation mark instead of an apostrophe. stop_words in tidytext uses an apostrophe. You have to replace the right single quotation with an apostrophe.

I found this out by:

> utf8ToInt('it’s')
[1]  105  116 8217  115 

Googling the 8217 lead me to here. From there it's as easy as grabbing the C++/Java source \u2019 and adding a mutate and gsub statement prior to your anti-join.

tidy_books <- books %>%
  unnest_tokens(word, text) %>%
  mutate(word = gsub("\u2019", "'", word)) %>% 
  anti_join(stop_words) %>%
  count(title, word, sort=TRUE) %>%
  ungroup() 

Results in:

enter image description here

Jake Kaupp
  • 7,892
  • 2
  • 26
  • 36