8

I am processing SPSS data from a questionnaire that must have originated in M$ Word. Word automatically changes hyphens into long hyphens, and gets converted into characters that don't display properly, i.e. "-" turns into "ú".

My question: What is the equivalent to utf8ToInt() in the WINDOWS-1252 character set?

utf8ToInt("A")
[1] 65

When I do this with my own data, I get an error:

x <- str_sub(levels(sd$j1)[1], 7, 7)
print(x)
[1] "ú"

utf8ToInt(x)
Error in utf8ToInt(x) : invalid UTF-8 string

However, the contents of x are perfectly usable in grep and gsub expressions.

> Sys.getlocale()
[1] "LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United Kingdom.1252;LC_MONETARY=English_United Kingdom.1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252"
David Heffernan
  • 601,492
  • 42
  • 1,072
  • 1,490
Andrie
  • 176,377
  • 47
  • 447
  • 496
  • 1
    Good luck trying to do anything with Unicode text in R for Windows! – David Heffernan Mar 05 '11 at 16:46
  • 1
    Your problem is that "ú" is not encoded as UTF-8. It's actually "\xfa" and is encoded in what R calls latin1 but is really 1252 I imagine. Windows has a fundamentally different way of handling Unicode text and I don't believe that R on Windows makes any effort to do it the Windows way. I offer you sympathy but I can't offer more! – David Heffernan Mar 05 '11 at 16:51
  • @David Hefferman, thank you very much for both the insight and the sympathy! Still, is there a way to determine the internal representation of this latin1 character, something like latin1ToInt() - which I haven't yet found. And, vice-versa, if I know the Latin1 code is 150, how do I generate the actual character, again something like intToLatin1()? – Andrie Mar 07 '11 at 09:21
  • @Andrie I'm very sorry, but I've got nothing really to offer. I find the R documentation less than helpful and I have this strong fear that it has been developed for UNIX based systems which use byte streams and locales and no consideration has been made to the UTF-16 based Unicode of Windows. – David Heffernan Mar 07 '11 at 09:28
  • @David Thank you. My code works for now, and I don't *really* need to know the codes. To reuse this work in future projects, a clumsy workaround would be to create hash table of non-printing characters and their ASCII equivalents, then save() and load() for future projects. Many thanks for your help. – Andrie Mar 07 '11 at 09:45
  • @David Hefferman, you might be interested in the solution I concocted and posted as an answer here. – Andrie Mar 12 '11 at 11:39

3 Answers3

5

If you load the SPSS sav file via read.spss form package foreign, you could easily import the data frame with correct encoding via specifying the encoding like:

read.spss("foo.sav", reencode="CP1252")
daroczig
  • 28,004
  • 7
  • 90
  • 124
  • I was very excited to read your response and tried this immediately on my data. Sadly, this didn't have the desired effect and I still get characters that don't display properly. I have also looped through the entire iconvlist() but still no luck. But thank you for a very helpful answer. – Andrie Mar 07 '11 at 09:24
5

After some head-scratching, lots of reading help files and trial-and-error, I created two little functions that does what I need. These functions work by converting their input into UTF-8 and then returning the integer vector for the UTF-8 encoded character vector, and vice versa.

# Convert character to integer vector
# Optional encoding specifies encoding of x, defaults to current locale
encToInt <- function(x, encoding=localeToCharset()){
    utf8ToInt(iconv(x, encoding, "UTF-8"))
}

# Convert integer vector to character vector
# Optional encoding specifies encoding of x, defaults to current locale
intToEnc <- function(x, encoding=localeToCharset()){
    iconv(intToUtf8(x), "utf-8",  encoding)
}

Some examples:

x <- "\xfa"
encToInt(x)
[1] 250

intToEnc(250)
[1] "ú"
Andrie
  • 176,377
  • 47
  • 447
  • 496
  • Congratulations. Looks like a good job. Now that I've got used to the simplicity of locale free Windows UTF-16 handling of Unicode I find all the jumping through hoops with locales in R somewhat jading. I guess my comments made that clear!! – David Heffernan Mar 12 '11 at 11:42
0

I use a variation on Andrie's code:

  • Vectorised on x so that I can apply it to a vector/column of characters
  • Which handles character encoded by two utf8 characters (like "\u0098" which gives c(194, 152)), by simply returning the last encoding integer.

This is useful when for example to map latin1/cp1252 characters to an integer range, which is my application ("more compact file format", they say). This is obviously not appropriate if you need to convert the integer back to a character at some point.

encToInt <- Vectorize(
  function(x, encoding){
    out <- utf8ToInt(iconv(x, encoding, "UTF-8"))
    out[length(out)]
  },
  vectorize.args="x", USE.NAMES = F, SIMPLIFY=T)
asachet
  • 6,620
  • 2
  • 30
  • 74