0

I have a string called 'str', that I get from loading an RDS file.

This string contains French accents that display just fine in R studio console. However when using the ngram package on this string, the accented characters don't display right.

If I define an accented string directly in R it works just fine (see 'str2' in the code below).

How can I solve this, for example, by forcing a new encoding on my original string.

str # console displays "crédit hypothécaire en juillet"
ng <- ngram(str, n = 2,sep= " ")
get.phrasetable(ng)
# ngrams freq      prop
# 1      hypothécaire en     1 0.3333333
# 2 crédit hypothécaire     1 0.3333333
# 3            en juillet     1 0.3333333
str2 <- "crédit hypothécaire en juillet"
ng2 <- ngram(str2, n = 2,sep= " ")
get.phrasetable(ng2)
# ngrams freq      prop
# 1     hypothécaire en     1 0.3333333
# 2 crédit hypothécaire     1 0.3333333
# 3          en juillet     1 0.3333333

EDIT:

Suggested link (handling special characters e.g. accents in R) didn't provide a solution to my issue in the validated answer, so it's not a duplicate question, but it did provide some clues, see answer below

Community
  • 1
  • 1
moodymudskipper
  • 46,417
  • 11
  • 121
  • 167

1 Answers1

0

Following the link of @ErikSchutte in the question's comment I found what I needed. It's not a duplicate however as the validated answer didn't work for me.

I'll post what worked but I don't understand why it does so I won't validate my own answer, I'll validate a better one if it comes.

From 'handling special characters e.g. accents in R' I find the following ideas:

Encoding(str) <- "UTF-8"
Encoding(str) <- "LATIN1"
str <- iconv(str, from="UTF-8", to="LATIN1")
str <- iconv(str, from="LATIN1", to="UTF-8")
enc2utf8(as(str, "character"))

One (and only one) of them worked for me, this one:

str <- iconv(str, from="UTF-8", to="LATIN1")

EDIT:

This line works well when you know your string is not encoded right, but it will change it into NA, if it was encoded right. Here is my unsexy solution to solve the issue:

str_arr # a string or array of strings
encode_to_latin1 <- function(str_arr){
  str_arr_converted <- iconv(str_arr, from="UTF-8", to="LATIN1")
  nas <- is.na(str_arr_converted)
  str_arr_converted[nas] <- str_arr[nas]
  return(str_arr_converted)  
}
str_arr <- encode_to_latin1(str_arr)
Community
  • 1
  • 1
moodymudskipper
  • 46,417
  • 11
  • 121
  • 167