R Studio can not read chinese character in txt file properly

Question

While i was trying to read a txt file with read.table(), I met problems viewing the dataset in Rstudio. The original txt.file consists of three columns data including ID, Content(Cantonese) and Time, like the following format:

100008251304976 你又知喎 2019-10-04 16:52:15
100027970365477 甘你買多幾包花生，小心熱氣 2019-10-04 16:23:43

I wrote the code to read it into Rstudio

x = read.table('comment.txt', encoding = 'utf-8', quote = "",fill = T,sep = '\t')

but the result is messey data.

ç”˜ä½ è²·å¤šå¹¾åŒ…èŠ±ç”Ÿï¼Œå°å¿ƒç†±æ°£ 2019å¹´10æ

Then i checked my env and locale as follows

sessionInfo()
#R version 3.6.1 (2019-07-05)
#Platform: x86_64-w64-mingw32/x64 (64-bit)
#Running under: Windows 10 x64 (build 18362)

#Matrix products: default

#locale:
#[1] LC_COLLATE=English_Hong Kong SAR.1252  LC_CTYPE=English_Hong Kong SAR.1252   
#[3] LC_MONETARY=English_Hong Kong SAR.1252 LC_NUMERIC=C                          
#[5] LC_TIME=English_Hong Kong SAR.1252    

#attached base packages:
#[1] stats     graphics  grDevices utils     datasets  methods   base     

#loaded via a namespace (and not attached):
#[1] compiler_3.6.1   rsconnect_0.8.16 tools_3.6.1      tinytex_0.16     xfun_0.10       
#[6] packrat_0.5.0  

Sys.getlocale()
# "LC_COLLATE=English_Hong Kong SAR.1252;LC_CTYPE=English_Hong Kong SAR.1252;LC_MONETARY=English_Hong Kong SAR.1252;LC_NUMERIC=C;LC_TIME=English_Hong Kong SAR.1252"

Sys.getenv("LANG")
# "C.UTF-8"

Any ideas why I can not load txt file properly? By the way, i am able to tpye or print traditional Chinese in the Rstudio.

print("試試")
# [1] "試試"

```Sys.setlocale(category="LC_ALL",locale="chinese")``` did not work for me and it will generate another type of messy codes ```鐢樹綘璨峰骞惧寘鑺辩敓锛屽皬蹇冪啽姘``` — Calvin, Jun 01 '20 at 06:46
`encoding` seems to be case-sensitive. However, using `encoding='UTF-8'` then `print(x[2])` returns something like `` and `,` instead of Chinese strings. My native locale strings are returned correct as well as `print("試試")` runs OK. (my locale is `LC_CTYPE=Czech_Czechia.1250`).. — JosefZ, Jun 01 '20 at 11:58

score 0 · Answer 1 · answered Jun 01 '20 at 14:02

Input file (added a line in my native locale):

100008251304976 Třiatřicet žlutých šišinek  2019-10-04 16:52:15
100008251304976 你又知喎    2019-10-04 16:52:15
100027970365477 甘你買多幾包花生，小心熱氣   2019-10-04 16:23:43

R code snippet (converting individual rows of the x data frame could be done in a loop, I know…):

sessionInfo()

library(stringi)
library(magrittr)

x <- read.table('d:\\bat\\R\\comment.txt', encoding = 'UTF-8', quote = "\"", fill = TRUE, sep = '\t')

print(x)

x['V2'][1,] %>% 
  stri_replace_all_regex("<U\\+([[:alnum:]]+)>", "\\\\u$1") %>% 
  stri_unescape_unicode() %>% 
  stri_enc_toutf8()
x['V2'][2,] %>% 
  stri_replace_all_regex("<U\\+([[:alnum:]]+)>", "\\\\u$1") %>% 
  stri_unescape_unicode() %>% 
  stri_enc_toutf8()
x['V2'][3,] %>% 
  stri_replace_all_regex("<U\\+([[:alnum:]]+)>", "\\\\u$1") %>% 
  stri_unescape_unicode() %>% 
  stri_enc_toutf8()

Result (paste the code snippet to an open Rstudio console):

> sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)

Matrix products: default

locale:
[1] LC_COLLATE=Czech_Czechia.1250  LC_CTYPE=Czech_Czechia.1250    LC_MONETARY=Czech_Czechia.1250
[4] LC_NUMERIC=C                   LC_TIME=Czech_Czechia.1250    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] magrittr_1.5  stringi_1.1.5

loaded via a namespace (and not attached):
[1] compiler_3.4.1 tools_3.4.1   
> library(stringi)
> library(magrittr)
> 
> x <- read.table('d:\\bat\\R\\comment.txt', encoding = 'UTF-8', quote = "\"", fill = TRUE, sep = '\t')
> 
> print(x)
            V1                                                                                                V2
1 1.000083e+14                                                                        Třiatřicet žlutých šišinek
2 1.000083e+14                                                                  <U+4F60><U+53C8><U+77E5><U+558E>
3 1.000280e+14 <U+7518><U+4F60><U+8CB7><U+591A><U+5E7E><U+5305><U+82B1><U+751F>,<U+5C0F><U+5FC3><U+71B1><U+6C23>
                   V3
1 2019-10-04 16:52:15
2 2019-10-04 16:52:15
3 2019-10-04 16:23:43
> 
> x['V2'][1,] %>% 
+   stri_replace_all_regex("<U\\+([[:alnum:]]+)>", "\\\\u$1") %>% 
+   stri_unescape_unicode() %>% 
+   stri_enc_toutf8()
[1] "Třiatřicet žlutých šišinek"
> x['V2'][2,] %>% 
+   stri_replace_all_regex("<U\\+([[:alnum:]]+)>", "\\\\u$1") %>% 
+   stri_unescape_unicode() %>% 
+   stri_enc_toutf8()
[1] "你又知喎"
> x['V2'][3,] %>% 
+   stri_replace_all_regex("<U\\+([[:alnum:]]+)>", "\\\\u$1") %>% 
+   stri_unescape_unicode() %>% 
+   stri_enc_toutf8()
[1] "甘你買多幾包花生，小心熱氣"
>

Used the accepted answer to convert utf8 code point strings like to utf8.

thanks JosefZ. Since my dataset is very big, it was time-consuming to convert the encoding line by line. I finally solved this by replacing windows with mac lol. It was indeed strange that the encoding problem disappeared in mac OS. Thanks anyway. — Calvin, Jun 08 '20 at 03:25

R Studio can not read chinese character in txt file properly

1 Answers1

Linked