0

I search through the questions and able to replace • in my first set of command. But when I apply to my corpus, it doesn't work, the • still appear. The corpus has 6570 elements,2.3mb, so it seems to be valid.

> x <- ". R Tutorial"
> gsub("•","",x)
[1] ". R Tutorial"

> removeSpecialChars <- function(x) gsub("•","",x)
> corpus2=tm_map(corpus2, removeSpecialChars)
> print(corpus2[[6299]][1])
[1] "• R tutorial • success– october"
> ##remove special characters
Ken Benoit
  • 14,454
  • 27
  • 50
Etalo
  • 11
  • 2
  • Your first call to `gsub()` didn't illustrate the point, because `•` was missing from `x`. But that aside, I tested it and it worked. I don't know what your actual problem is. – Tim Biegeleisen Mar 22 '17 at 03:39
  • My problem is when it's simply the first call, the gsub function work. But when I apply it in my code in the second call with tm_map and corpus, it can't remove the • – Etalo Mar 22 '17 at 03:55
  • Probably another encoding issue. – Hong Ooi Mar 22 '17 at 05:10

1 Answers1

0

How about this for an alternative that works in a more straightforward way with corpus objects?

require(quanteda)
require(magrittr)

corpus3 <- corpus(c("• R Tutorial", "More of these • characters •", "Tricky •!"))

# remove the character from the tokenized corpus
tokens(corpus3)
## tokens from 3 documents.
## text1 :
## [1] "R"        "Tutorial"
## 
## text2 :
## [1] "More"       "of"         "these"      "characters"
## 
## text3 :
## [1] "Tricky" "!"  
tokens(corpus3) %>% tokens_remove("•")
## tokens from 3 documents.
## [1] "R"        "Tutorial"
## text1 :
## 
## text2 :
## [1] "More"       "of"         "these"      "characters"
## 
## text3 :
## [1]] "Tricky" "!"  

# remove the character from the corpus itself
texts(corpus3) <- gsub("•", "", texts(corpus3), fixed = TRUE)
texts(corpus3)
##         text1                        text2                        text3 
## " R Tutorial" "More of these  characters "                   "Tricky !" 
Ken Benoit
  • 14,454
  • 27
  • 50