2

I have a question regarding the language pre-processing in quanteda R. I want to generate a document-feature matrix based on some documents. So, I generated a corpus and run the following code.

data <- read.csv2("abstract.csv", stringsAsFactors = FALSE)
corpus<-corpus(data, docid_field = "docname", text_field = "documents")
dfm <- dfm(corpus, stem = TRUE, remove = stopwords('english'),
           remove_punct = TRUE, remove_numbers = TRUE, 
           remove_symbols = TRUE, remove_hyphens = TRUE)

When I examined the dfm I noticed some tokens (#ml, @attribut, _iq, 0.01ms). I rather want to have (ml, attribut, iq, ms).

I thought I deleted all the symbols and numbers. Why do I still get them?

I'd be glad to get some help.

Thanks!!!

Andrew Gustar
  • 17,295
  • 1
  • 22
  • 32
Hu_Ca
  • 47
  • 1
  • 5
  • 2
    If you check the help for `tokens` it says that, e.g. `remove_numbers` will remove tokens (words) that consist only of numbers, but not numbers that appear alongside other characters. You might be better off taking these numbers and other characters out of your data using something like the `stringr` package if that is what you need. – Andrew Gustar Jun 03 '19 at 17:51

1 Answers1

3

For really fine control you will want to process the text yourself through pattern replacement. Using stringi (or stringr) you can replace Unicode categories for symbols or punctuation easily.

Consider this example.

txt <- "one two, #ml @attribut _iq, 0.01ms."

quanteda::tokens(txt, remove_twitter = TRUE, remove_punct = TRUE)
## tokens from 1 document.
## text1 :
## [1] "one"      "two"      "ml"       "attribut" "_iq"      "0.01ms"

That's an easy way to remove the special characters that might indicate "Twitter" or other social media conventions.

For more low-level control:

# how to remove the leading _ (just to demonstrate)
stringi::stri_replace_all_regex(txt, "(\\b)_(\\w+)", "$1$2")
## [1] "one two, #ml @attribut iq, 0.01ms."

# remove all digits
(txt <- stringi::stri_replace_all_regex(txt, "\\d", ""))
## [1] "one two, #ml @attribut _iq, .ms."
# remove all punctuation and symbols
(txt <- stringi::stri_replace_all_regex(txt, "[\\p{p}\\p{S}]", ""))
## [1] "one two ml attribut iq ms"

quanteda::tokens(txt)
## tokens from 1 document.
## text1 :
## [1] "one"      "two"      "ml"       "attribut" "iq"       "ms"

Which is what you are aiming for, I am (partly) guessing.

Ken Benoit
  • 14,454
  • 27
  • 50
  • Is it possible to do the exact same thing you suggest but on an object of class `tokens`? For instance, if I do `quanta::tokens_remove(tokens(text), "[[:digit:]])` any tokens that contains a number will be removed altogether. Is there a workaround here to avoid looping on a character vector and directly exploit **quanteda** objects? – Francesco Grossetti Feb 12 '21 at 12:17
  • 1
    You should ask this as a new question with more details, but in short, you can remove numbers from tokens using `tokens(x, remove_numbers = TRUE)`. – Ken Benoit Feb 13 '21 at 09:49