Calculate `tf-idf` for a data frame of documents

Question

The following code

library(dplyr)
library(janeaustenr)
library(tidytext)
book_words <- austen_books() %>%
  unnest_tokens(word, text) %>%
  count(book, word, sort = TRUE)

book_words <- book_words %>%
  bind_tf_idf(word, book, n)
book_words

taken from Term Frequency and Inverse Document Frequency (tf-idf) Using Tidy Data Principles, estimates the tf-idf in Jane Austen's works. Anyway, this code appears to be specific to Jane Austen's books. I would like to derive, istead, the tf-idf for the following data frame:

sentences<-c("The color blue neutralizes orange yellow reflections.", 
             "Zod stabbed me with blue Kryptonite.", 
             "Because blue is your favourite colour.",
             "Red is wrong, blue is right.",
             "You and I are going to yellowstone.",
             "Van Gogh looked for some yellow at sunset.",
             "You ruined my beautiful green dress.",
             "You do not agree.",
             "There's nothing wrong with green.")

 df=data.frame(text = sentences, 
               class = c("A","B","A","C","A","B","A","C","D"),
               weight = c(1,1,3,4,1,2,3,4,5))

score 3 · Accepted Answer · edited Mar 25 '20 at 20:01

There are two things you needed to change:

since you did not set stringsAsFactors = FALSE when constructing the data.frame, you need to convert text to character first.
You do not have a column named book, which means you have to select some other column as document. Since you put a column named class into your example, I assume you want to calculate the tf-idf over this column.

Here is the code:

library(dplyr)
library(janeaustenr)
library(tidytext)
book_words <- df %>%
  mutate(text = as.character(text)) %>% 
  unnest_tokens(output = word, input = text) %>%
  count(class, word, sort = TRUE)

book_words <- book_words %>%
  bind_tf_idf(term = word, document = class, n)
book_words
#> # A tibble: 52 x 6
#>    class word          n     tf   idf tf_idf
#>    <fct> <chr>     <int>  <dbl> <dbl>  <dbl>
#>  1 A     blue          2 0.0769 0.288 0.0221
#>  2 A     you           2 0.0769 0.693 0.0533
#>  3 C     is            2 0.2    0.693 0.139 
#>  4 A     and           1 0.0385 1.39  0.0533
#>  5 A     are           1 0.0385 1.39  0.0533
#>  6 A     beautiful     1 0.0385 1.39  0.0533
#>  7 A     because       1 0.0385 1.39  0.0533
#>  8 A     color         1 0.0385 1.39  0.0533
#>  9 A     colour        1 0.0385 1.39  0.0533
#> 10 A     dress         1 0.0385 1.39  0.0533
#> # ... with 42 more rows

The documentation has helpful remarks for this check out ?count and ?bind_tf_idf.

Calculate `tf-idf` for a data frame of documents

1 Answers1