8

There is this strange behavior of stringr, which is really annoying me. stringr changes without a warning the encoding of some strings that contain exotic characters, in my case ø, å, æ, é and some others... If you str_trim a vector of characters, then those with exotic letters will be converted to a new Encoding.

letter1 <- readline('Gimme an ASCII character!')     # try q or a
letter2 <- readline('Gimme an non-ASCII character!') # try ø or é
Letters <- c(letter1, letter2)
Encoding(Letters)           # 'unknown'
Encoding(str_trim(Letters)) # mixed 'unknown' and 'UTF-8'

This is a problem because I use data.table for (fast) merge of big tables and that data.table does not support mixed encoding and because I could not find a way to get back to the uniform encoding.

Any work-around?

EDIT: i thought I could get back to the base functions, but they don't either protect encoding. paste conserves it, but not sub for instance.

 Encoding(paste(' ', Letters))                 # 'unknown'
 Encoding(str_c(' ', Letters))                 # mixed
 Encoding(sub('^ +', '', paste(' ', Letters))) # mixed
Arthur
  • 1,208
  • 13
  • 25
  • 1
    I see it mixed for `letters` as well. By the way, `letters` is a constant string in base R. – Frank Nov 02 '15 at 16:35
  • 'unknown' is the local encoding, if I understood it correctly, so it may depend on the machines, I guess... I changed letters for Letters, since you seemed annoyed I overwrite a constant. – Arthur Nov 02 '15 at 16:38
  • If you have a way to create 'unknown' encoding on any machine, please share! – Arthur Nov 02 '15 at 16:42
  • 1
    You can do `Encoding(Letters) = ''` to clear the encoding. But that’s not a very satisfactory solution. – Konrad Rudolph Nov 02 '15 at 16:57
  • It seems to work but I am not sure it works all the time. I think I had situations where the `Encoding<-` function would not change the actual encoding. But cannot find any exemple right now. – Arthur Nov 02 '15 at 17:01
  • I'm not expert on encodings, I'm just saying that on my machine: `Encoding(c("a","ø")) # [1] "unknown" "latin1"` in contrast with the "unknown" on yours. I'm not "annoyed" about your variable naming :) just letting you know about it. – Frank Nov 02 '15 at 17:47
  • Got it. And I knew about `letters`, but `letters` does not contain any exotic characters. – Arthur Nov 02 '15 at 18:26
  • @KonradRudolph Can you put that as an answer, it works for now, even if does not remove the interrogation about why string manipulation functions do change the encoding... – Arthur Nov 06 '15 at 10:58

3 Answers3

3

stringr is changing the encoding because stringr is a wrapper around the stringi package, and stringi always encodes in UTF-8. See help("stringi-encoding", package = "stringi") for details and an explanation of this design choice.

To avoid problems with merging data.tables, just make sure all the id variable(s) are encoded in UTF-8. You can do that using stri_enc_toutf8 in the stringi package, or using iconv.

Ista
  • 10,139
  • 2
  • 37
  • 38
  • using `iconv` gives apparently mixed enconding: `Encoding(iconv(Letters, to='UTF-8'))`; `stri_enc_toutf8` gives surprisingly a uniformly native-encoded vector...: `Encoding(stringi:::stri_enc_toutf8(Letters))` ; what I don't get is why **mixed** encoding? – Arthur Nov 04 '15 at 08:33
  • It probably isn't really mixed, this is just a quirk of the `Encoding` function. In particular ASCII characters are never marked with an encoding, see `?Encoding` for details. – Ista Nov 04 '15 at 12:27
  • Thank you for the explanation. But then if mixed-encoding is just the default why is `data.table` complaining about it? And why does it handle it so poorly? Except some hundred millions anglophones, all the other use non-ASCII characters. I mean: the unwanted consequence of using `stringr` functions is that I get UTF-8 for non-ASCII characters; but I **do get** wrong results in `data.frame` when merging on those. – Arthur Nov 04 '15 at 15:47
  • @Arthur I think the `merge.data.table` warning is pretty clear. As explained there the merge should be done correctly if the only non-marked characters are ASCII. If that's not the case you should open a bug report against data.table, or at least post a SO question that includes a reproducible example of the data.table merge errors you are seeing. – Ista Nov 04 '15 at 16:23
  • Completely agree. I don't incriminate `data.table` which works as intended. However, if I use a string manipulation functions on the fly inside `data.table`, I get wrong merges without a warning since I get 2 encodings without notice: the local one and UTF-8. – Arthur Nov 06 '15 at 10:56
  • @Arthur No, you didn't read my comment carefully. You should _not_ get wrong merges. If you do you should ask a question about it here on SO or open a bug report. – Ista Nov 06 '15 at 12:27
  • ``DT <- data.table(letters=sample(c('a', 'ø'), 10, TRUE) %>% `Encoding<-`(''), numbers=1:10); DT2 <- DT[,list(.N), keyby=list(letters2=str_replace(letters, 'ø', 'å'))]['å' %>% `Encoding<-`('')]; Encoding(DT2$letters2)`` seems to work. If I find an exemple, I'll come back. – Arthur Nov 06 '15 at 12:47
2

With this recent commit, data.table now takes care of these mixed encodings implicitly by ensuring proper encodings while creating data.tables, as well as by ensuring proper encodings in functions like unique() and duplicated().

See news item (23) under bugs for v1.9.7 in README.md.

Please test and write back if you face any further issues.

Arun
  • 116,683
  • 26
  • 284
  • 387
1

R doesn’t always make it easy to convert between encodings (there’s the function iconv for that but what this function accepts is platform dependent). However, at the very least you can always reset the encoding marking of a string to “unknown”:

Letters = str_trim(Letters)
Encoding(Letters)
# [1] "unknown" "UTF-8"
Encoding(Letters) = ''
Encoding(Letters)
# [1] "unknown" "unknown"

However, note that this only marks the encoding of a string, it doesn’t actually re-encode the string. As a consequence, this can lead to garbled data. As mentioned in the comments, this is at best a hack, not an actual fix for the problem.

Encoding exemplifies R’s trouble to work properly with encodings. The documentation says:

ASCII strings will never be marked with a declared encoding, since their representation is the same in all supported encodings.

… which is obviously not helpful at all (and also more than a bit misleading; an UTF-8 string consisting only of code points < 128 may look indistinguishable to an ASCII string but operating on it should yield different results depending on encoding, which is why it should effectively be marked).

Interestingly, neither enc2native nor enc2utf8 will do the desired thing here — both will yield in different encodings for the two strings in Letters, a direct consequence of the Encoding problem cited above.

Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
  • 2
    The approach suggested in this answer is not a good one. For one thing this will not work at all on Windows (or possibly any platform where the default encoding is not UTF8). For another it doesn't distinguish clearly between _changing_ the encoding (e.g., with `iconv`) and _marking_ the encoding (with `Encoding`). More generally, the problem it tries to solve doesn't actually exist; the strings returned by `str_trim` and other functions in the stringr package are perfectly fine, and the elements _do not have different encodings_. As documented, `Encoding` doesn't mark ascii characters. – Ista Nov 07 '15 at 02:48
  • @Ista You say that the different encodings don't pose a problem but if you look at the question, that's simply not true. Furthermore, I'm intentionally only marking the encoding and not converting since, as you've noticed, encoding conversion in R is not supported to the same extent on all platforms. Now, I entirely agree with you that this isn't a proper solution, which is why I posted precisely that as a content rather than an answer, until the OP asked me to post an answer as well. – Konrad Rudolph Nov 07 '15 at 11:27
  • There _are no different encodings_. All elements of the character vector returned by `str_trim` are encoded in UTF-8. Try running this answer on Windows (print `Letters` afterward) and you'll see why stripping the encoding marker is a bad idea. Try actually merging some `data.table` objects on UTF-8 encoded strings (without stripping the `Encoding`) and you'll see that it works correctly, just as the warning message says it will. I know why you posted this answer, but for the sake of others coming across this question in the future please edit to make clear that this is asking for trouble. – Ista Nov 07 '15 at 14:27
  • @Ista So you're saying that contrary to what the OP said, data.table accepts this? Well in that case this answer is obviously useless. – Konrad Rudolph Nov 07 '15 at 22:04
  • `merge.data.table` gives a warning but merges correctly. The text of the warning indicates that there is no problem under the conditions being discussed here. – Ista Nov 07 '15 at 22:09
  • @Ista Sigh. Thanks, I'll change the answer later. – Konrad Rudolph Nov 07 '15 at 22:09
  • Is there any chance that ASCII behaviour will be changed in R? Thanks to you answer, I understood my problem [here](https://stackoverflow.com/q/45028581/5784831) but if feels like I start with a hack... – Christoph Jul 11 '17 at 07:58