0

I am using the R programming language. I am trying to replicate the previous stackoverflow post over here (R) About stopwords in DocumentTermMatrix , for the purpose of "tokenizing" and removing "stop words".

Using some publicly available Shakespeare Plays, I created a "term document matrix" using 3 plays:

#load libraries
library(dplyr)
library(pdftools)
library(tidytext)
library(textrank)
library(tm)

#1st document
url <- "https://shakespeare.folger.edu/downloads/pdf/hamlet_PDF_FolgerShakespeare.pdf"

article <- pdf_text(url)
article_sentences <- tibble(text = article) %>%
  unnest_tokens(sentence, text, token = "sentences") %>%
  mutate(sentence_id = row_number()) %>%
  select(sentence_id, sentence)


article_words <- article_sentences %>%
  unnest_tokens(word, sentence)


article_words_1 <- article_words %>%
  anti_join(stop_words, by = "word")

#2nd document
url <- "https://shakespeare.folger.edu/downloads/pdf/macbeth_PDF_FolgerShakespeare.pdf"

article <- pdf_text(url)
article_sentences <- tibble(text = article) %>%
  unnest_tokens(sentence, text, token = "sentences") %>%
  mutate(sentence_id = row_number()) %>%
  select(sentence_id, sentence)


article_words <- article_sentences %>%
  unnest_tokens(word, sentence)


article_words_2<- article_words %>%
  anti_join(stop_words, by = "word")


#3rd document
url <- "https://shakespeare.folger.edu/downloads/pdf/othello_PDF_FolgerShakespeare.pdf"

article <- pdf_text(url)
article_sentences <- tibble(text = article) %>%
  unnest_tokens(sentence, text, token = "sentences") %>%
  mutate(sentence_id = row_number()) %>%
  select(sentence_id, sentence)


article_words <- article_sentences %>%
  unnest_tokens(word, sentence)


article_words_3 <- article_words %>%
  anti_join(stop_words, by = "word")

From here, I create the actual "term document matrix":

library(tm)

#create term document matrix
tdm <- TermDocumentMatrix(Corpus(VectorSource(rbind(article_words_1, article_words_2, article_words_3))))

#inspect the "term document matrix" (I don't know why this is producing an error)
inspect(tdm)

After this, I am trying to perform "tokenization" and remove "stop words" using two different methods (source: (R) About stopwords in DocumentTermMatrix )

library(quanteda)

#first method:

first_method <- tokens(tdm) %>%
  tokens_remove(stopwords("en"), pad = TRUE)

Error: tokens() only works on character, corpus, list, tokens objects.


#second method:

second_method <- dfm(text, remove_punct = TRUE) %>%
  dfm_remove(stopwords("en"))

Error: dfm() only works on character, corpus, list, tokens objects.

Both of these steps result an errors, indicating that these functions only work on "character, corpus, list or token" objects. Is there some way to use these functions on the term document matrix I created?

Thanks

stats_noob
  • 5,401
  • 4
  • 27
  • 83
  • 1
    What is your goal with all of this? Because article_words_1 is already tokenized and stopwords are removed. tidytext has functions to transform the data.frames into a dtm, tdm or a dfm with `cast_xx` commands. – phiver May 04 '21 at 10:25
  • @phiver: thank you for your reply! This is just an example i thought of with Shakespeare plays. My real data is not tokenized and has stop words. My real data is already a term document matrix. Is there a way to tokenize and remove stop words from a term document matrix? Thank you – stats_noob May 04 '21 at 16:11
  • does anyone know how to convert this to a corpus object? – stats_noob May 05 '21 at 03:08
  • You can't transform a tdm into a corpus as it already has been split and the features have been counted. But I see what you want to achieve. I will answer with an example on the new question you opened. Please close / delete this one. – phiver May 05 '21 at 10:24

0 Answers0