Writing data isn't preserving encoding

Question

I have a string like the following:

str <- "ていただけるなら"
Encoding(str) #returns "UTF-8"

I write it to disk:

write.table(str, file="chartest", quote=F, col.names=F, row.names=F)

Now I look at the file in Notepadd++, which is set to UTF-8 without BOM encoding, and I get this:

<U+3066><U+3044><U+305F><U+3060><U+3051><U+308B><U+306A><U+3089>

What is going wrong in this process? I would like the written text file to display the string as it appears in R.

This is on Windows 7, R version 2.15

Try this: `writeLines(str, "chartest2.txt", useBytes=TRUE)` – Montgomery Clift Oct 01 '21 at 06:27 — Montgomery Clift, Oct 01 '21 at 06:27

score 16 · Answer 1 · edited May 23 '17 at 11:47

16

This is an annoying "feature" of R in Windows. The only solution that I have found so far is to temporarily and programatically switch your locale to the appropriate one required to decode the script of the text in question. So, in the above case you would use the Japanese locale.

## This won't work on Windows
str <- "ていただけるなら"
Encoding(str) #returns "UTF-8"
write.table(str, file="c:/chartest.txt", quote=F, col.names=F, row.names=F)
## The following should work on Windows - first grab and save your existing locale
print(Sys.getlocale(category = "LC_CTYPE"))
original_ctype <- Sys.getlocale(category = "LC_CTYPE")
## Switch to the appropriate local for the script
Sys.setlocale("LC_CTYPE","japanese")
## Now you can write your text out and have it look as you would expect
write.table(str, "c:/chartest2.txt", quote = FALSE, col.names = FALSE, 
            row.names = FALSE, sep = "\t", fileEncoding = "UTF-8")
## ...and don't forget to switch back
Sys.setlocale("LC_CTYPE", original_ctype)

The above produces the two files you can see in this screenshot. The first file shows the Unicode code points, which is not what you want, while the second shows the glyphs you would normally expect.

Japanese text

So far nobody has been able to explain to me why this happens in R. It is not an unavoidable feature of Windows because Perl, as I mention in this post, gets round the issue somehow.

edited May 23 '17 at 11:47

Community

1
1

answered Jun 28 '12 at 15:16

SlowLearner

7,907
11
49
80

Thanks for your response. Japanese was just an example, I'd like something that works for all language types. Doesn't sound that promising though. – qua Jun 30 '12 at 14:50
@qua - Yes, I thought it might be a random example given that you broke the Japanese string at a rather odd place. Unfortunately at this point I don't think a better solution exists for R, but please do create your own answer to this question if you find one! I agree that if you do not know the script beforehand you might struggle with my approach because (amongst other problems) it would require you to guess at the script being used and there simply is no surefire way of establishing the encoding type. – SlowLearner Jun 30 '12 at 22:40
@SlowLearner Is there a way to list all the valid values for LC_CTYPE? – statsNoob Nov 11 '15 at 13:49
I do not know. I had a very specific usage aim so when I found what I was looking for I stopped searching. – SlowLearner Nov 11 '15 at 21:47
@SlowLearner Man, you are the best !! – Mohamed Kamal May 18 '17 at 08:17

score 2 · Answer 2 · answered Jun 27 '12 at 14:22

2

Have you tried using argument fileEncoding ?

write.table(str, file="chartest", quote=F, col.names=F, row.names=F, fileEncoding="UTF-8")

answered Jun 27 '12 at 14:22

plannapus

18,529
4
72
94

1

Thanks for the suggestion. Trying that doesn't seem to work though. – qua Jun 27 '12 at 14:25
OK so at the time i had tried on my work computer (on Mac OSX) and it worked but since then I tried on my home computer (Windows 7) and it didn't indeed. – plannapus Jun 28 '12 at 06:12

Writing data isn't preserving encoding

2 Answers2

Linked