1

I have a .csv file which should be in 'UTF-8' encoding. I have exported it from Sql Server Management Studio. However, when importing it to R it fails on the lines with ÿ. I use read.csv2 and specify file encoding "UTF-8-BOM".

Notepad++ correctly displays the ÿ and says it is UTF-8 encoding. Is this a bug with the R encoding, or is ÿ in fact not part of the UTF-8 encoding scheme?

I have uploaded a small tab delimited .txt file that fails here: https://www.dropbox.com/s/i2d5yj8sv299bsu/TestData.txt

Thanks

Mace
  • 1,259
  • 4
  • 16
  • 35
  • ÿ is code 255 for ISO 8859-1. I suspect code has an EOF condition written to an 8-bit character. – chux - Reinstate Monica Feb 04 '14 at 15:50
  • In what way does R fail the importing? An error message of some sort or data gets cut off or transformed somehow? – LauriK Feb 04 '14 at 18:39
  • @LauriK No error message - just cuts off the import at the first line containing the letter. – Mace Feb 05 '14 at 08:06
  • Seems like what @chux said could be true. So you can either use some other R functions or if it's a one-time deal then replace the character in Notepad++ with something else and replace it back in R. – LauriK Feb 05 '14 at 08:11
  • Do you mean that the r code for `read.csv()` reads `ÿ` as EOF? I have tried using read.table and saving as tab delimited text files instead but I get the same problem. Do you have any suggestions for what function to use? – Mace Feb 05 '14 at 09:52

1 Answers1

0

That is probably part of the BOM marker at the beginning. If the editor or parser doesn't recognize BOM markers it believes it is garbage. See https://www.ultraedit.com/support/tutorials-power-tips/ultraedit/unicode.html for more details.

aled
  • 21,330
  • 3
  • 27
  • 34