UTF-8 file encoding in R

Question

I have a .csv file which should be in 'UTF-8' encoding. I have exported it from Sql Server Management Studio. However, when importing it to R it fails on the lines with ÿ. I use read.csv2 and specify file encoding "UTF-8-BOM".

Notepad++ correctly displays the ÿ and says it is UTF-8 encoding. Is this a bug with the R encoding, or is ÿ in fact not part of the UTF-8 encoding scheme?

I have uploaded a small tab delimited .txt file that fails here: https://www.dropbox.com/s/i2d5yj8sv299bsu/TestData.txt

Thanks

ÿ is code 255 for ISO 8859-1. I suspect code has an EOF condition written to an 8-bit character. — chux - Reinstate Monica, Feb 04 '14 at 15:50
In what way does R fail the importing? An error message of some sort or data gets cut off or transformed somehow? — LauriK, Feb 04 '14 at 18:39
@LauriK No error message - just cuts off the import at the first line containing the letter. — Mace, Feb 05 '14 at 08:06
Seems like what @chux said could be true. So you can either use some other R functions or if it's a one-time deal then replace the character in Notepad++ with something else and replace it back in R. — LauriK, Feb 05 '14 at 08:11
Do you mean that the r code for `read.csv()` reads `ÿ` as EOF? I have tried using read.table and saving as tab delimited text files instead but I get the same problem. Do you have any suggestions for what function to use? — Mace, Feb 05 '14 at 09:52

score 0 · Answer 1 · answered May 26 '21 at 12:24

0

That is probably part of the BOM marker at the beginning. If the editor or parser doesn't recognize BOM markers it believes it is garbage. See https://www.ultraedit.com/support/tutorials-power-tips/ultraedit/unicode.html for more details.

answered May 26 '21 at 12:24

aled

21,330
3
27
34

UTF-8 file encoding in R

1 Answers1