I'm, using the TwitteR package to download tweets from twitter. The tweets are downloaded and stored in a MySQL database. I want to get rid of all "unknown characters". The problem is that gsub() convert my åäö characters to aao. Here i have extracted one row as an example:
> testing <- outputdata$text[396]
> stri_enc_mark(testing) # Gets declared encodings for each string
[1] "UTF-8"
> Encoding(testing) # Reads the declared encodings
[1] "UTF-8"
> all(stri_enc_isutf8(testing)) # check if every character is UTF-8
[1] TRUE
> testing <- gsub("[^0-9A-Za-z@#:åäöÅÄÖ///' ]", "", testing,)
> testing
[1] "Mycket bra intervju med Sapo chefen Anders Tjornberg pa TV4 alldeles nyss "
Before running gsub() the tweet look liked this:
"Mycket bra intervju med Säpo chefen Anders Tjornberg på TV4 alldeles nyss ��"
If i try the following code gsub() doesn't convert åäö to aao... The problem is that when i copy-past it works but not when loaded from the data frame.
> testing <- "Mycket bra intervju med Säpo chefen Anders Tjornberg på TV4 alldeles nyss ��"
> stri_enc_mark(testing)
[1] "UTF-8"
> Encoding(testing)
[1] "UTF-8"
> all(stri_enc_isutf8(testing))
[1] TRUE
> testing <- gsub("[^0-9A-Za-z@#:åäöÅÄÖ///' ]", "", testing,)
> testing
[1] "Mycket bra intervju med Säpo chefen Anders Tjornberg på TV4 alldeles nyss "
I have tried using:
outputdata$text <- iconv(outputdata$text, to = "UTF-8", sub="")
outputdata$text <- iconv(outputdata$text, "UTF-8", "UTF-8",sub='')
on the whole data frame to delete all non-UTF-8 characters but with no luck. I don't know if this is relevant:
Encoding(outputdata$text)
[1] "UTF-8" "UTF-8" "UTF-8" "UTF-8" "UTF-8" "UTF-8" "unknown" "UTF-8" "UTF-8" "UTF-8" "UTF-8" "UTF-8" "UTF-8" "UTF-8"
Maybe ten procent of the observations are unknown.