gsub() regular expressions deletes åäö in R (UTF-8 encoding)

Question

I'm, using the TwitteR package to download tweets from twitter. The tweets are downloaded and stored in a MySQL database. I want to get rid of all "unknown characters". The problem is that gsub() convert my åäö characters to aao. Here i have extracted one row as an example:

> testing <- outputdata$text[396]
> stri_enc_mark(testing) # Gets declared encodings for each string
[1] "UTF-8"
> Encoding(testing) # Reads the declared encodings
[1] "UTF-8"
> all(stri_enc_isutf8(testing)) # check if every character is UTF-8
[1] TRUE
> testing <- gsub("[^0-9A-Za-z@#:åäöÅÄÖ///' ]", "", testing,)
> testing
[1] "Mycket bra intervju med Sapo chefen Anders Tjornberg pa TV4 alldeles nyss  "

Before running gsub() the tweet look liked this:

"Mycket bra intervju med Säpo chefen Anders Tjornberg på TV4 alldeles nyss  ��"

If i try the following code gsub() doesn't convert åäö to aao... The problem is that when i copy-past it works but not when loaded from the data frame.

> testing <- "Mycket bra intervju med Säpo chefen Anders Tjornberg på TV4 alldeles nyss  ��"
> stri_enc_mark(testing)
[1] "UTF-8"
> Encoding(testing)
[1] "UTF-8"
> all(stri_enc_isutf8(testing))
[1] TRUE
> testing <- gsub("[^0-9A-Za-z@#:åäöÅÄÖ///' ]", "", testing,)
> testing
[1] "Mycket bra intervju med Säpo chefen Anders Tjornberg på TV4 alldeles nyss  "

I have tried using:

outputdata$text <- iconv(outputdata$text, to = "UTF-8", sub="")
outputdata$text <- iconv(outputdata$text, "UTF-8", "UTF-8",sub='')

on the whole data frame to delete all non-UTF-8 characters but with no luck. I don't know if this is relevant:

Encoding(outputdata$text)
[1] "UTF-8"   "UTF-8"   "UTF-8"   "UTF-8"   "UTF-8"   "UTF-8"   "unknown" "UTF-8"   "UTF-8"   "UTF-8"   "UTF-8"   "UTF-8"   "UTF-8"   "UTF-8"

Maybe ten procent of the observations are unknown.

score 0 · Answer 1 · answered Feb 17 '15 at 14:13

Perhaps your title is confusing? Am I correct that you do NOT want to convert the characters with diacritical marks? When I used your text the gsub() worked exactly as I think you wished, it seemed, and preserved the diacritical-mark characters (but removed, for example, �.

> testing <- "Mycket bra intervju med Säpo chefen Anders Tjornberg på TV4 alldeles nyss  ��"
> testing2 <- gsub("[^0-9A-Za-z@#:åäöÅÄÖ///' ]", "", testing)
> testing2
[1] "Mycket bra intervju med Säpo chefen Anders Tjornberg på TV4 alldeles nyss  "
> testing3 <- "RT @K_GBergstrom: Arbetsgivaravgifterna för unga sänks 1 maj, föreslår regeringen. Sen väntas de höjas (tredubblas?) kanske 1 juli. Politik…"
> testing3 <- "RT @K_GBergstrom: Arbetsgivaravgifterna för unga sänks 1 maj, föreslår regeringen. Sen väntas de höjas (tredubblas?) kanske 1 juli. Politik…"
> testing4 <- gsub("[^0-9A-Za-z@#:åäöÅÄÖ///' ]", "", test3)
> testing4
[1] "RT @KGBergstrom: Arbetsgivaravgifterna för unga sänks 1 maj föreslår regeringen Sen väntas de höjas tredubblas kanske 1 juli Politik"

As a small point, your tags might include regex; whether mysql is apposite I doubt.

The problem is that i works when you copy and paste the text, like you did and I have done in the example added. But when I when do the same thing on the data in the dataset it doesn't work as expected. — Bollen, Feb 17 '15 at 16:55
I knew it wasn't an answer, but your situation may not be reproducible. Can you scrape some of the tweets into a spreadsheet and read that into R directly from that file? Then see if the gsub works. Perhaps MySQL adds (or subtracts) some encoding? — lawyeR, Feb 17 '15 at 19:03

score 0 · Answer 2 · edited May 23 '17 at 10:29

This looks like an issue with Unicode Normalization Forms. See this answer for a likely explanation. When adapted to this situation, testing probably contains "ä" as "a" + "combining diaeresis above" and "å" as "a" + "combining ring above". The gsub() substitution strips the combining characters away, leaving only "a".

As a remedy, you could try standardizing your text strings to the NFC form. For example:

library(stringi)
testing <- "Mycket bra intervju med Säpo chefen Anders Tjornberg på TV4 alldeles nyss  ��"
## This transformation is probably unnecessary
sub_pat <- stri_trans_nfc("[^0-9A-Za-z@#:åäöÅÄÖ///' ]")

testing_nfc <- stri_trans_nfc(testing)
## This should work
gsub(sub_pat, "", testing_nfc)

testing_nfd <- stri_trans_nfd(testing)
## This should convert ä and å to a
gsub(sub_pat, "", testing_nfd)

Another issue: The repeated slashes /// don't make much sense. Maybe the intention was to keep both slashes and backslashes, "[^0-9A-Za-z@#:åäöÅÄÖ/\\' ]".

gsub() regular expressions deletes åäö in R (UTF-8 encoding)

2 Answers2