EDIT See EDIT below
I'm trying to convert a corpus object to tokens using R and Quanteda. Using the options in token() I cannot seem to remove the underscores in some words/characters. When I try using stri_replace_all_regex() the characters completely disappear.
The following code gives the below output
CODE
dirty_corpus <- corpus(textdata)
toks <- dirty_corpus %>%
stringi::stri_replace_all_regex("\'[a-z]*", "") %>%
tokens(what = "word", remove_punct = TRUE, preserve_tags = FALSE, remove_numbers = TRUE, remove_separators = TRUE,
remove_url = TRUE, split_hyphens = TRUE, remove_symbols = TRUE, split_tags = TRUE, verbose = TRUE) %>%
tokens_remove(pattern = phrase(english_stopwords), valuetype = 'fixed') %>%
tokens_wordstem() %>%
tokens_tolower()
Output:
text6 : [1] "ys" "s_" "_t" "_s" "sw" "lnk" "smn" "pstd" "dwn" "blw" [11] "srri"
I want the following output:
text6 : [1] "ys" "s" "t" "s" "sw" "lnk" "smn" "pstd" "dwn" "blw" [11] "srri"
When I chain:
stringi::stri_replace_all_regex("_", "") %>%
resuling into the code:
dirty_corpus <- corpus(textdata)
toks <- dirty_corpus %>%
stringi::stri_replace_all_regex("\'[a-z]*", "") %>%
stringi::stri_replace_all_regex("_", "") %>%
tokens(what = "word", remove_punct = TRUE, preserve_tags = FALSE, remove_numbers = TRUE, remove_separators = TRUE,
remove_url = TRUE, split_hyphens = TRUE, remove_symbols = TRUE, split_tags = TRUE, verbose = TRUE) %>%
tokens_remove(pattern = phrase(english_stopwords), valuetype = 'fixed') %>%
tokens_wordstem() %>%
tokens_tolower()
The output becomes the following:
text6 : [1] "ys" "sw" "lnk" "smn" "pstd" "dwn" "blw" "srri"
Making the characters previously contain the underscore disappear.
How can I obtain the result I intend?
EDIT In hindsight everything was performing exactly as planned! Since I didn't write all of the code myself I did not realize the characters being removed were in the stopwords list, hence they were being removed! Stupid!