How can I find out the internal code representation of a WINDOWS-1252 character?

Question

I am processing SPSS data from a questionnaire that must have originated in M$ Word. Word automatically changes hyphens into long hyphens, and gets converted into characters that don't display properly, i.e. "-" turns into "ú".

My question: What is the equivalent to utf8ToInt() in the WINDOWS-1252 character set?

utf8ToInt("A")
[1] 65

When I do this with my own data, I get an error:

x <- str_sub(levels(sd$j1)[1], 7, 7)
print(x)
[1] "ú"

utf8ToInt(x)
Error in utf8ToInt(x) : invalid UTF-8 string

However, the contents of x are perfectly usable in grep and gsub expressions.

> Sys.getlocale()
[1] "LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United Kingdom.1252;LC_MONETARY=English_United Kingdom.1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252"

Good luck trying to do anything with Unicode text in R for Windows! — David Heffernan, Mar 05 '11 at 16:46
Your problem is that "ú" is not encoded as UTF-8. It's actually "\xfa" and is encoded in what R calls latin1 but is really 1252 I imagine. Windows has a fundamentally different way of handling Unicode text and I don't believe that R on Windows makes any effort to do it the Windows way. I offer you sympathy but I can't offer more! — David Heffernan, Mar 05 '11 at 16:51
@David Hefferman, thank you very much for both the insight and the sympathy! Still, is there a way to determine the internal representation of this latin1 character, something like latin1ToInt() - which I haven't yet found. And, vice-versa, if I know the Latin1 code is 150, how do I generate the actual character, again something like intToLatin1()? — Andrie, Mar 07 '11 at 09:21
@Andrie I'm very sorry, but I've got nothing really to offer. I find the R documentation less than helpful and I have this strong fear that it has been developed for UNIX based systems which use byte streams and locales and no consideration has been made to the UTF-16 based Unicode of Windows. — David Heffernan, Mar 07 '11 at 09:28
@David Thank you. My code works for now, and I don't *really* need to know the codes. To reuse this work in future projects, a clumsy workaround would be to create hash table of non-printing characters and their ASCII equivalents, then save() and load() for future projects. Many thanks for your help. — Andrie, Mar 07 '11 at 09:45
@David Hefferman, you might be interested in the solution I concocted and posted as an answer here. — Andrie, Mar 12 '11 at 11:39

score 5 · Answer 1 · answered Mar 05 '11 at 17:42

5

If you load the SPSS sav file via read.spss form package foreign, you could easily import the data frame with correct encoding via specifying the encoding like:

read.spss("foo.sav", reencode="CP1252")

answered Mar 05 '11 at 17:42

daroczig

28,004
7
90
124

I was very excited to read your response and tried this immediately on my data. Sadly, this didn't have the desired effect and I still get characters that don't display properly. I have also looped through the entire iconvlist() but still no luck. But thank you for a very helpful answer. – Andrie Mar 07 '11 at 09:24

Andrie · Accepted Answer · 2011-03-17T06:50:53.420

After some head-scratching, lots of reading help files and trial-and-error, I created two little functions that does what I need. These functions work by converting their input into UTF-8 and then returning the integer vector for the UTF-8 encoded character vector, and vice versa.

# Convert character to integer vector
# Optional encoding specifies encoding of x, defaults to current locale
encToInt <- function(x, encoding=localeToCharset()){
    utf8ToInt(iconv(x, encoding, "UTF-8"))
}

# Convert integer vector to character vector
# Optional encoding specifies encoding of x, defaults to current locale
intToEnc <- function(x, encoding=localeToCharset()){
    iconv(intToUtf8(x), "utf-8",  encoding)
}

Some examples:

x <- "\xfa"
encToInt(x)
[1] 250

intToEnc(250)
[1] "ú"

Congratulations. Looks like a good job. Now that I've got used to the simplicity of locale free Windows UTF-16 handling of Unicode I find all the jumping through hoops with locales in R somewhat jading. I guess my comments made that clear!! — David Heffernan, Mar 12 '11 at 11:42

score 0 · Answer 3 · answered Jun 06 '16 at 15:54

I use a variation on Andrie's code:

Vectorised on x so that I can apply it to a vector/column of characters
Which handles character encoded by two utf8 characters (like "\u0098" which gives c(194, 152)), by simply returning the last encoding integer.

This is useful when for example to map latin1/cp1252 characters to an integer range, which is my application ("more compact file format", they say). This is obviously not appropriate if you need to convert the integer back to a character at some point.

encToInt <- Vectorize(
  function(x, encoding){
    out <- utf8ToInt(iconv(x, encoding, "UTF-8"))
    out[length(out)]
  },
  vectorize.args="x", USE.NAMES = F, SIMPLIFY=T)

How can I find out the internal code representation of a WINDOWS-1252 character?

3 Answers3

Linked