3

Character strings that look completely normal when printed to the RStudio console but appear as strange characters when written to csv and opened with excel.

Reproducible example

The following generates the object that appears as the string "a wit", then writes it to a csv:

# install.packages("dplyr")
library(dplyr)

serialized_char <- "580a000000030003060200030500000000055554462d380000001000000001000080090000000661c2a0776974"

(string <- serialized_char %>% 
    {substring(., seq(1, nchar(.), 2), seq(2, nchar(.), 2))} %>% 
    paste0("0x", .) %>% 
    as.integer %>% 
    as.raw %>% 
    unserialize())
[1] "a wit"

write.csv(string, "myfile.csv", row.names=F)

This is what it looks like when written from Mojave (and viewed in excel in OSX Mojave) - contains undesirable characters:

enter image description here

This is when it's written in High Sierra (and viewed in excel in High Sierra) - contains undesirable characters:

enter image description here

When is when written from Windows 10 and viewed in excel on windows 10 (looks good!):

enter image description here

This is when it is written from Mojave, but viewed in excel on Windows 10 - - still contains undesirable characters:

enter image description here

Question

I have a lot of character data of the form above (with characters that looks strange when written to csv and opened in excel) - how can these be cleaned in such a way that the text appears 'normally' in excel.

What I've tried

I have tried 4 things so far

write.csv(string, "myfile.csv", fileEncoding = 'UTF-8')

Encoding(string) <- "latin-1"

Encoding(string) <- "UTF-8"

iconv(string, "UTF-8", "latin1", sub=NA)

stevec
  • 41,291
  • 27
  • 223
  • 311

1 Answers1

3

The problem isn’t R, the problem is Excel.

Excel has its own ideas about what a platform’s character encoding should be. Notably, it insists, even on modern macOSs, that the platform encoding is naturally Mac Roman. Rather than the actually prevailing UTF-8.

The file is correctly written as UTF-8 on macOS by default.

To get Excel to read it correctly, you need to choose “File” › “Import…”, and from thre follow the import wizard which lets you specify the file encoding.

Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
  • Do you know how to treat the string in R such that excel will read it as desired? I've tried everything I know – stevec May 13 '20 at 14:28
  • @stevec There is no *good* way. You could add a UTF-8 BOM but absolutely *everybody* (including Microsoft!) agrees that this is a terrible practice. The technically best way would probably be to use Excel automation to write an XLSX file. – Konrad Rudolph May 13 '20 at 14:29
  • BOM is quite old school I thought? – stevec May 13 '20 at 14:30
  • Right. BOM is old-school because it’s a bad solution. But then this behaviour of Excel is also bad, and arguably a bug. – Konrad Rudolph May 13 '20 at 14:30
  • Great knowledge. Do you know if there's a way to coerce the `"a wit"` seen above such that `identical(a_wit_from_above, "a wit")` is `TRUE` ? – stevec May 13 '20 at 14:32
  • I think there must be some way to somehow coerce the "a wit" in the question to the literal "a wit", which would see it write as desired. I've tried all the conversion methods and encoding tricks i know though – stevec May 13 '20 at 14:38
  • @stevec Actually `stringi::stri_trans_nfkc` might be what you’re after: it translates the non-breaking space to a regular space. However, you’ll still have to handle diacritics after that, if so desired. – Konrad Rudolph May 13 '20 at 15:01
  • 1
    @stevec This handles diacritics as well as the non-breaking space: `iconv(stringi::stri_trans_nfkd(x), 'UTF-8', 'ASCII', sub='')`; with this, `'ä\ua0test'` becomes `'a test'`. – Konrad Rudolph May 13 '20 at 15:06
  • I tried it on the MRE, but also on the much larger example I had (17MB of all sorts of characters) - it works! – stevec May 13 '20 at 15:12