I have a question regarding the language pre-processing in quanteda R. I want to generate a document-feature matrix based on some documents. So, I generated a corpus and run the following code.
data <- read.csv2("abstract.csv", stringsAsFactors = FALSE)
corpus<-corpus(data, docid_field = "docname", text_field = "documents")
dfm <- dfm(corpus, stem = TRUE, remove = stopwords('english'),
remove_punct = TRUE, remove_numbers = TRUE,
remove_symbols = TRUE, remove_hyphens = TRUE)
When I examined the dfm I noticed some tokens (#ml, @attribut, _iq, 0.01ms
). I rather want to have (ml, attribut, iq, ms
).
I thought I deleted all the symbols and numbers. Why do I still get them?
I'd be glad to get some help.
Thanks!!!