1

I have a text with different classes. My goal is to determine and keep only the features with the highest tf_idf value (top 20%) of each class.

As an example, I use the book_of_mormon data set. text is the text and book_title is the class.

An idea is to use the tidy_text package and filter the top 20% per class.

library(scriptuRs)
library(tidytext)
library(tidyverse)

First, I create the tf_idf values:

d = book_of_mormon %>%
  select(book_title, text) %>%
  unnest_tokens(word, text) %>%
  group_by(book_title) %>%
  count(word) %>%
  bind_tf_idf(word, book_title, n) 

head(d, 3)

# A tibble: 3 x 6
# Groups:   book_title [1]
  book_title word          n        tf   idf    tf_idf
  <chr>      <chr>     <int>     <dbl> <dbl>     <dbl>
1 1 Nephi    a           200 0.00795   0     0        
2 1 Nephi    abhorreth     1 0.0000398 2.01  0.0000801
3 1 Nephi    abide         1 0.0000398 0.916 0.0000364

Then, filter the top 20% of the tf_idf values per class.

d = d %>%
  group_by(book_title) %>% 
  arrange(book_title, -tf_idf) %>%
  filter(tf_idf > quantile(tf_idf, .8))

Finally, I cast the data frame into a (dtm) matrix. So I have the books as observations and the features as columns.

d = d %>%
  cast_dtm(word, book_title, tf_idf) 

d = as.data.frame(as.matrix(d))

However, if I cast the data frame back to matrix, which is necessary for my task, the number of rows decreases (i.e. some documents/observations drop).

dim(d)
[1] 19099     6


dim(book_of_mormon)
[1] 6604   19

Another idea is to use the tm package. However, using large data sets (like my original one) R will run out of memory.

First, I create the dtm and a data frame.

library(tm)

corpus = Corpus(VectorSource(book_of_mormon$text))

corpus = corpus %>%
  tm_map(removeWords, stopwords("en")) %>%
  tm_map(removeNumbers) %>%
  tm_map(removePunctuation) %>%
  tm_map(tolower)

dtm = DocumentTermMatrix(corpus)

dtm = weightTfIdf(dtm, normalize = TRUE)

dtm =  as.data.frame(as.matrix(dtm))

dtm$book_title = book_of_mormon$book_title

Then, I filter the features with the highest values per class.

dict = dtm %>%
  gather(Variable, Value, -book_title) %>%
      group_by(book_title) %>% 
      arrange(book_title, -Value) %>%
      top_n(5, Value) # I use top_n to keep the data small (i.e it´s 
                      # computational expensive to filter out the top 
                      # 20% which would lead to a long runtime in R in this 
                      # example)

Finally, I create a filtered dtm with the top 20% (top 5) features per class.

dtm2 = DocumentTermMatrix(corpus, control=list(dictionary = paste(dict$Variable)))

dtm2 = weightTfIdf(dtm2, normalize = TRUE)

dtm2 =  as.data.frame(as.matrix(dtm2))
Banjo
  • 1,191
  • 1
  • 11
  • 28
  • `‘tidy_text’ is not available (for R version 3.6.0`. Is this deprexated – akrun Jun 18 '19 at 20:35
  • If I understand you correctly here, this is what will happen when you filter to only the top X% of tf-idf. If you only keep words that are in the top tf-idf, then you are throwing out all the other words. You'll only keep the documents that contain the high tf-idf words and you won't have the documents that contain lower tf-idf words only anymore. – Julia Silge Jul 10 '19 at 16:37
  • Yes, that's right. I want to keep only those terms (here top 20% ), which separates the classes well. It´s like a data set with a large number of numerical variables. I only want to keep those variables which are highly correlated with each class. – Banjo Jul 11 '19 at 11:27

0 Answers0