I am using the R programming language. I am trying to replicate the previous stackoverflow post over here (R) About stopwords in DocumentTermMatrix , for the purpose of "tokenizing" and removing "stop words".
Using some publicly available Shakespeare Plays, I created a "term document matrix" using 3 plays:
#load libraries
library(dplyr)
library(pdftools)
library(tidytext)
library(textrank)
library(tm)
#1st document
url <- "https://shakespeare.folger.edu/downloads/pdf/hamlet_PDF_FolgerShakespeare.pdf"
article <- pdf_text(url)
article_sentences <- tibble(text = article) %>%
unnest_tokens(sentence, text, token = "sentences") %>%
mutate(sentence_id = row_number()) %>%
select(sentence_id, sentence)
article_words <- article_sentences %>%
unnest_tokens(word, sentence)
article_words_1 <- article_words %>%
anti_join(stop_words, by = "word")
#2nd document
url <- "https://shakespeare.folger.edu/downloads/pdf/macbeth_PDF_FolgerShakespeare.pdf"
article <- pdf_text(url)
article_sentences <- tibble(text = article) %>%
unnest_tokens(sentence, text, token = "sentences") %>%
mutate(sentence_id = row_number()) %>%
select(sentence_id, sentence)
article_words <- article_sentences %>%
unnest_tokens(word, sentence)
article_words_2<- article_words %>%
anti_join(stop_words, by = "word")
#3rd document
url <- "https://shakespeare.folger.edu/downloads/pdf/othello_PDF_FolgerShakespeare.pdf"
article <- pdf_text(url)
article_sentences <- tibble(text = article) %>%
unnest_tokens(sentence, text, token = "sentences") %>%
mutate(sentence_id = row_number()) %>%
select(sentence_id, sentence)
article_words <- article_sentences %>%
unnest_tokens(word, sentence)
article_words_3 <- article_words %>%
anti_join(stop_words, by = "word")
From here, I create the actual "term document matrix":
library(tm)
#create term document matrix
tdm <- TermDocumentMatrix(Corpus(VectorSource(rbind(article_words_1, article_words_2, article_words_3))))
#inspect the "term document matrix" (I don't know why this is producing an error)
inspect(tdm)
After this, I am trying to perform "tokenization" and remove "stop words" using two different methods (source: (R) About stopwords in DocumentTermMatrix )
library(quanteda)
#first method:
first_method <- tokens(tdm) %>%
tokens_remove(stopwords("en"), pad = TRUE)
Error: tokens() only works on character, corpus, list, tokens objects.
#second method:
second_method <- dfm(text, remove_punct = TRUE) %>%
dfm_remove(stopwords("en"))
Error: dfm() only works on character, corpus, list, tokens objects.
Both of these steps result an errors, indicating that these functions only work on "character, corpus, list or token" objects. Is there some way to use these functions on the term document matrix I created?
Thanks