1

I'm trying to make Eclipse's console work with Polish (or any other non-english) characters, whether ANSI or UTF-8 encoded. It seems that, on Windows, R is only able to encode with ANSI, while Eclipse"s console forces UTF-8 or ISO-8859-1.

Trying to use ANSI CP-1250 (default Polish encoding for Windows), I:

  • encode the R script-file as ANSI CP-1250
  • set Eclipse properties, including "R Script File" content type (in General -> Content Types -> Text), "Text file encoding" (in General -> Workspace), Console encoding (in "Run Configurations" -> "R Console" -> "Common", as cp1250
  • set JVM properties in eclipse.ini, by adding lines: "-Dclient.encoding.override=cp1250", "-Dfile.encoding=cp1250"

with no effect at all. How may I force Eclipse to encode and display in R's locale?

The exactly same behavior persists, when all these options are set to 'UTF-8', instead of 'CP-1250'. Please note I'm unable to set R's locale to 'UTF-8' on Windows. Worth to mention that Rstudio, Rgui and Rterm do not pose any problems with default CP-1250 encoding, strings are displayed correctly.

Executed script:

print(Sys.getlocale())
Sys.setenv(LANG = 'pl_PL.cp1250')

x <- 'ąęłóżść'
message('Printing variable'); print(x); print(charToRaw(x))

Output 1: 'Run via source'--> string encoded with ANSI CP1250, but printed as ISO-8859-1

> source("C:/mjktfw/pit/workspace/test_encoding/run3.R", echo=FALSE, encoding="cp1250")
[1] "LC_COLLATE=Polish_Poland.1250;LC_CTYPE=Polish_Poland.1250;LC_MONETARY=Polish_Poland.1250;LC_NUMERIC=C;LC_TIME=Polish_Poland.1250"
Printing variable
[1] "¹ê³ó
[1] b9 ea b3 f3 bf 9c e6

Output 2: 'Run via submitting directly'--> string encoded with UTF-8, printed correctly

> print(Sys.getlocale())
[1] "LC_COLLATE=Polish_Poland.1250;LC_CTYPE=Polish_Poland.1250;LC_MONETARY=Polish_Poland.1250;LC_NUMERIC=C;LC_TIME=Polish_Poland.1250"
> Sys.setenv(LANG = 'pl_PL.cp1250')
> 
> x <- 'ąęłóżść'
> message('Printing variable'); print(x); print(charToRaw(x))
Printing variable
[1] "ąęłóżść"
 [1] c4 85 c4 99 c5 82 c3 b3 c5 bc c5 9b c4 87

Output 3: copy-paste to console: --> string encoded with UTF-8, printed correctly

> print(Sys.getlocale())
[1] "LC_COLLATE=Polish_Poland.1250;LC_CTYPE=Polish_Poland.1250;LC_MONETARY=Polish_Poland.1250;LC_NUMERIC=C;LC_TIME=Polish_Poland.1250"
> Sys.setenv(LANG = 'pl_PL.cp1250')
> 
> x <- 'ąęłóżść'
> message('Printing variable'); print(x); print(charToRaw(x))
Printing variable
[1] "ąęłóżść"
 [1] c4 85 c4 99 c5 82 c3 b3 c5 bc c5 9b c4 87

Output 4: 'Run entire command in R' shortcut --> string encoded as ANSI CP1250, but printed as plaintext Unicode codepoints

> print(Sys.getlocale())
[1] "LC_COLLATE=Polish_Poland.1250;LC_CTYPE=Polish_Poland.1250;LC_MONETARY=Polish_Poland.1250;LC_NUMERIC=C;LC_TIME=Polish_Poland.1250"
> Sys.setenv(LANG = 'pl_PL.cp1250')
> x <- 'ąęłóżść'
> message('Printing variable')
Printing variable
> print(x)
[1] "<U+00B9><ea><U+00B3><f3><U+00BF><U+009C><e6>"
> print(charToRaw(x))
[1] b9 ea b3 f3 bf 9c e6

Edit

After some more tinkering it occurs, that above cases result from different 'raw' encoding and Encoding()<- parameters of the string. The output below compares R/Rstudio and Eclipse/StatET behavior:

Eclipse

> # UTF-8 encoded string
> char <- rawToChar(as.raw(c(0xea, 0xb3, 0x9c)))
> sprintf('String: %s', char); sprintf('Encoding: %s | Raw: %s', Encoding(char), paste(charToRaw(char), collapse = ' '))
[1] "String: 곜"
[1] "Encoding: unknown | Raw: ea b3 9c"
> 
> Encoding(char) <- 'UTF-8'
> sprintf('String: %s', char); sprintf('Encoding: %s | Raw: %s', Encoding(char), paste(charToRaw(char), collapse = ' '))
[1] "String: <U+ACDC>"
[1] "Encoding: UTF-8 | Raw: ea b3 9c"
> 
> # ANSI encoded string
> char <- rawToChar(as.raw(c(0xc4, 0x99, 0xc5, 0x82, 0xc5, 0x9b)))
> sprintf('String: %s', char); sprintf('Encoding: %s | Raw: %s', Encoding(char), paste(charToRaw(char), collapse = ' '))
[1] "String: ęłś"
[1] "Encoding: unknown | Raw: c4 99 c5 82 c5 9b"
> 
> Encoding(char) <- 'UTF-8'
> sprintf('String: %s', char); sprintf('Encoding: %s | Raw: %s', Encoding(char), paste(charToRaw(char), collapse = ' '))
[1] "String: 곜"
[1] "Encoding: UTF-8 | Raw: c4 99 c5 82 c5 9b"

Rstudio

> # UTF-8 encoded string
> char <- rawToChar(as.raw(c(0xea, 0xb3, 0x9c)))
> sprintf('String: %s', char); sprintf('Encoding: %s | Raw: %s', Encoding(char), paste(charToRaw(char), collapse = ' '))
[1] "String: ęłś"
[1] "Encoding: unknown | Raw: ea b3 9c"
> 
> Encoding(char) <- 'UTF-8'
> sprintf('String: %s', char); sprintf('Encoding: %s | Raw: %s', Encoding(char), paste(charToRaw(char), collapse = ' '))
[1] "String: 곜"
[1] "Encoding: UTF-8 | Raw: ea b3 9c"
> 
> # ANSI encoded string
> char <- rawToChar(as.raw(c(0xc4, 0x99, 0xc5, 0x82, 0xc5, 0x9b)))
> sprintf('String: %s', char); sprintf('Encoding: %s | Raw: %s', Encoding(char), paste(charToRaw(char), collapse = ' '))
[1] "String: ęłś"
[1] "Encoding: unknown | Raw: c4 99 c5 82 c5 9b"
> 
> Encoding(char) <- 'UTF-8'
> sprintf('String: %s', char); sprintf('Encoding: %s | Raw: %s', Encoding(char), paste(charToRaw(char), collapse = ' '))
[1] "String: ęłś"
[1] "Encoding: UTF-8 | Raw: c4 99 c5 82 c5 9b"
mjktfw
  • 840
  • 6
  • 14
  • Working in eclipse in French quite successfully, did you solve your problem ? `Sys.getlocale() [1] "LC_COLLATE=French_France.1252;LC_CTYPE=French_France.1252;LC_MONETARY=French_France.1252;LC_NUMERIC=C;LC_TIME=French_France.1252"`. The encoding of R and Rnw files are set to Cp1252. I just have to add some `\usepackage[latin1]{inputenc}` in my latex preambles and be sure I wait to get the results in pdf. It I run directly some plots, then I have to convert my strings to UTF8 . – Cedric Dec 13 '17 at 13:16
  • Well, the issue is interconnected with how R communicates with the StatET console, so any actions, including sourcing files to generate data/plots/text output will produce proper results, as long as no console input/output is involved. I've reported this to StatET creator and he suggests this is an issue of jri/rJava. – mjktfw Dec 13 '17 at 20:26

0 Answers0