2

Here is a snippet from data encoding in R memory. The CSV file was read with encoding "Latin-1" using data.table::fread. As this piece suggests, the data is stored with different encodings, which is not desirable because I'll leave the data in a SQLite database, so whenever I send data to database and call it back, Latin-1 is not read in appropriately. Is there a way to normalize this? It seem that common functions like iconv won't work, once the data have multiple encodings in different parts of the data.frame.

Encoding(Data$DESC)

 [5305] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
 [5311] "unknown" "unknown" "unknown" "latin1"  "unknown" "unknown"
 [5317] "unknown" "latin1"  "latin1"  "latin1"  "latin1"  "unknown"
 [5323] "latin1"  "latin1"  "latin1"  "latin1"  "unknown" "latin1" 
Marie-Eve
  • 565
  • 4
  • 15
  • 2
    Please provide a reproducible example. And give your session info output including the version of data table. – Arun Feb 14 '16 at 15:17
  • What RDBS do use, may be you can set the encoding at the client the side http://stackoverflow.com/a/6477516/3338646 – huckfinn Feb 14 '16 at 15:27
  • 1
    I don't know who downvoted this question, but sometimes a proof of research effort or clarity is not just a matter of providing a reproducible example. I think this is a good question, and if you really need a dataset, try e.g., `df1 <- data.frame(matrix(letters[1:24],ncol=4),stringsAsFactors=FALSE)`. The command `sapply(df1,Encoding)` shows "unknown" for all entries. I'd be interested to see how the encoding of individual entries can be changed. – RHertel Feb 14 '16 at 18:04
  • That sounds like a bug in RSQlite - it should always convert to UTF-8 before sending to the db. – hadley Feb 17 '16 at 04:51

0 Answers0