2

EDIT See EDIT below

I'm trying to convert a corpus object to tokens using R and Quanteda. Using the options in token() I cannot seem to remove the underscores in some words/characters. When I try using stri_replace_all_regex() the characters completely disappear.

The following code gives the below output

CODE

dirty_corpus <- corpus(textdata)

toks <- dirty_corpus %>%
  stringi::stri_replace_all_regex("\'[a-z]*", "") %>%
  tokens(what = "word", remove_punct = TRUE, preserve_tags = FALSE, remove_numbers = TRUE, remove_separators = TRUE,
         remove_url = TRUE, split_hyphens = TRUE, remove_symbols = TRUE, split_tags = TRUE, verbose = TRUE) %>%
  tokens_remove(pattern = phrase(english_stopwords), valuetype = 'fixed') %>%
  tokens_wordstem() %>%
  tokens_tolower()

Output:

text6 : [1] "ys" "s_" "_t" "_s" "sw" "lnk" "smn" "pstd" "dwn" "blw" [11] "srri"

I want the following output:

text6 : [1] "ys" "s" "t" "s" "sw" "lnk" "smn" "pstd" "dwn" "blw" [11] "srri"

When I chain:

stringi::stri_replace_all_regex("_", "") %>%

resuling into the code:

dirty_corpus <- corpus(textdata)

toks <- dirty_corpus %>%
  stringi::stri_replace_all_regex("\'[a-z]*", "") %>%
  stringi::stri_replace_all_regex("_", "") %>%
  tokens(what = "word", remove_punct = TRUE, preserve_tags = FALSE, remove_numbers = TRUE, remove_separators = TRUE,
         remove_url = TRUE, split_hyphens = TRUE, remove_symbols = TRUE, split_tags = TRUE, verbose = TRUE) %>%
  tokens_remove(pattern = phrase(english_stopwords), valuetype = 'fixed') %>%
  tokens_wordstem() %>%
  tokens_tolower()

The output becomes the following:

text6 : [1] "ys" "sw" "lnk" "smn" "pstd" "dwn" "blw" "srri"

Making the characters previously contain the underscore disappear.

How can I obtain the result I intend?

EDIT In hindsight everything was performing exactly as planned! Since I didn't write all of the code myself I did not realize the characters being removed were in the stopwords list, hence they were being removed! Stupid!

DartLazer
  • 31
  • 5
  • What is the outpu if you just run the first 3 lines? Ie. `dirty_corpus %>% stringi::stri_replace_all_regex("\'[a-z]*", "") %>% stringi::stri_replace_all_regex("_", "")` My guess is that the 't's and 's's are getting removed by the next 3 lines. – Nicolás Velasquez Dec 16 '22 at 18:12
  • Hey @NicolásVelásquez thx for your answer. Well I think it's not the case because if I run the first example after that code they are still in there. After the regex calls nothing is changed in the other samples! – DartLazer Dec 16 '22 at 18:56
  • Hello @DartLazer, to better help you could you please let us know which packages are you using? Also, a sample of the dirty_corpus might be helpful. – Nicolás Velasquez Dec 16 '22 at 19:12
  • @NicolásVelásquez Thank you once again for your comment! After debugging I came to the conclusion that your first Guess was right! These characters were in my stopwords list hence they were being removed! My apologies! – DartLazer Dec 19 '22 at 13:39

2 Answers2

3

You can use types() and tokens_replace() to modify your tokens using stringi functions.

require(quanteda)
txt <- "_t ys s_ _t _s sw lnk _s"
toks <- tokens(txt)
print(toks)
#> Tokens consisting of 1 document.
#> text1 :
#> [1] "_t"  "ys"  "s_"  "_t"  "_s"  "sw"  "lnk" "_s"
toks_rep <- tokens_replace(toks, 
                           pattern = types(toks), 
                           replacement = stringi::stri_replace_all_fixed(types(toks), "_", ""),
                           valuetype = "fixed")
print(toks_rep)
#> Tokens consisting of 1 document.
#> text1 :
#> [1] "t"   "ys"  "s"   "t"   "s"   "sw"  "lnk" "s"

You can also use tokens_split(toks, "_") if you have the underscores in-between like "y_s".

Kohei Watanabe
  • 750
  • 3
  • 6
  • Thank you very much! In the end the problem was of my own doing! I accidentally removed these characters using the remove stopwords function. They were included in the stopwords list! – DartLazer Dec 19 '22 at 13:40
1

In hindsight everything was performing exactly as planned! Since I didn't write all of the code myself I did not realize the characters being removed were in the stopwords list, hence they were being removed! Stupid!

DartLazer
  • 31
  • 5