2

I have a string vector, in which some values are in Vietnamese, written in UTF-8 encoding.

> so_wrong
 [1] "Thiết bị & dịch vụ"     "Quản lý"               
 [3] "Hãng"                   "Thời tiết"             
 [5] "Lý do khác"             "Tàu bay về muộn"       
 [7] "Kỹ thuật"               "Thương mại"            
 [9] "Khai thác"              "Quản lý, điều hành bay"
[11] " "                     

I want to remove another vector which contains the last two values: "Quản lý, điều hành bay" and " ". But R does not recognize them.

> any(so_wrong == " ")
[1] FALSE
> any(so_wrong == "Quản lý, điều hành bay")
[1] FALSE

...even through the values input in these commands is exactly the values in the vector (I copy-pasted them in). This work, on the other hand:

> any(so_wrong == so_wrong[11])
[1] TRUE

What is the problem and how to solve/workaround with it?

EDIT: The encoding

> Encoding(so_wrong)
 [1] "UTF-8"  "UTF-8"  "latin1" "UTF-8"  "UTF-8"  "UTF-8"  "UTF-8" 
 [8] "UTF-8"  "latin1" "UTF-8"  "UTF-8" 

EDIT: I saved the vector to a csv and pushed it here

  • use charToRaw to check the raw bytes. There might be more than one space in the last vector and extraspaces around the 10th vector that might not be showing up. also use trimws to strip whitespace – infominer Dec 28 '16 at 20:34
  • Using chartoRaw on the " " value give me the result c2 a0 . What should I learn from this? – Hiếu Phẩy Dec 28 '16 at 20:48
  • You can also see whether there are any non ascii characters in an object with this command using system call from R to octal dump: `system(sprintf("echo %s | od -c", so_wrong[11]))` – Serhat Cevikel Dec 28 '16 at 21:01

1 Answers1

3

I copied that problematic string ("Quản lý, điều hành bay") to R, assigned to an object, checked for logical equality and it was OK.

> so_wrong <- "Quản lý, điều hành bay"
> so_wrong == "Quản lý, điều hành bay"
[1] TRUE

I think the problem is with your encoding options. You can try two things:

  • Set the encoding option to utf-8 explicitly:

    options(encoding="utf-8")

By the way my encoding option is "native.enc"

> getOption("encoding")
[1] "native.enc"

You can also give that a try.

  • You can set the encoding of the input, if you read from a file. From the man page of read.table:

read.table(file, header = FALSE, sep = "", quote = "\"'", dec = ".", numerals = c("allow.loss", "warn.loss", "no.loss"), row.names, col.names, as.is = !stringsAsFactors, na.strings = "NA", colClasses = NA, nrows = -1, skip = 0, check.names = TRUE, fill = !blank.lines.skip, strip.white = FALSE, blank.lines.skip = TRUE, comment.char = "#", allowEscapes = FALSE, flush = FALSE, stringsAsFactors = default.stringsAsFactors(), fileEncoding = "", encoding = "unknown", text, skipNul = FALSE)

So you can set the encoding explicitly to "utf-8" in read.table.

Serhat Cevikel
  • 720
  • 3
  • 11
  • I have changed the encoding between UTF-8 and native.enc (which I believe to be the default encoding in R?) But the result is still the same. I read this from a docx file, with a function that i wrote, based on this [link](https://rud.is/b/2015/08/23/using-r-to-get-data-out-of-word-docs/) But I don't think my function is the problem, as it has nothing to do with the values itself. – Hiếu Phẩy Dec 28 '16 at 20:45
  • Could you please execute the Encoding(x) function where x is the object containing the string? What is the output? I am checking the encoding options of xml2 package by the way. – Serhat Cevikel Dec 28 '16 at 20:50
  • From the man page of read_xml: read_xml(x, encoding = "", ..., as_html = FALSE, options = "NOBLANKS"). Did you set the encoding in your function if you used that? – Serhat Cevikel Dec 28 '16 at 20:52
  • I have edited the question with your comment and a csv to my file. – Hiếu Phẩy Dec 28 '16 at 20:58
  • And what is the output to `getOption("encoding")` – Serhat Cevikel Dec 28 '16 at 21:02
  • I have also added `encoding = "UTF-8"` to read_xml() but the problem remains. – Hiếu Phẩy Dec 28 '16 at 21:02
  • `getOption("encoding")` returns `"utf-8"` as of now. – Hiếu Phẩy Dec 28 '16 at 21:04
  • This post http://stackoverflow.com/questions/23699271/force-character-vector-encoding-from-unknown-to-utf-8-in-r explains stringi package functions such as `stri_encode`, and also the `file` function and this man page https://stat.ethz.ch/R-manual/R-devel/library/base/html/iconv.html explains `iconv` function from base package. I'm trying to apply the ideas to your csv file. But I also see that - from the octal dump - there are items with non UTF-8 character such as the second string. – Serhat Cevikel Dec 28 '16 at 21:40
  • This old problem has been resolved since I moved from Window to a Linux machine. R on Window has problems with Latin Extended characters and it is acknowledged in our Vietnamese statistic community that moving to a Linux machine is currently the only solution. I answer your answer as the most helpful answer thus far. Thank you. – Hiếu Phẩy May 28 '18 at 09:28