Encode while preserving undefined characters

Question

Here I got a string from outside db ãƒ\u008F, and I want to convert it back to unicode character. I know the db is using windows-1252 encoding, so the actual character should be \xe3\x83\x8f, which is ハ in utf-8 encoding.

Here are the things I've tried so far

"ãƒ\u008F".encode('windows-1252')
# => Encoding::UndefinedConversionError: U+008F to WINDOWS-1252 in conversion from UTF-8 to WINDOWS-1252

"ãƒ\u008F".encode('windows-1252', undef: :replace)
# => "\xE3\x83?"

This is reasonable, since 0x8f is undefined in windows-1252's codepage.

----------Windows-1252-----------
  0 1 2 3 4 5 6 7 8 9 a b c d e f
2   ! " # $ % & ' ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4 @ A B C D E F G H I J K L M N O
5 P Q R S T U V W X Y Z [ \ ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | } ~ 
8 € � ‚ ƒ „ … † ‡ ˆ ‰ Š ‹ Œ � Ž � <---right here!
9 � ‘ ’ “ ” • – — ˜ ™ š › œ � ž Ÿ
a   ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬  ® ¯
b ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿
c À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï
d Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß
e à á â ã ä å æ ç è é ê ë ì í î ï
f ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ

My question is, how can I encode while preserving the undefined character? Namely, how can I get

s = "ãƒ\u008F".some_magic_methods
# => "\xE3\x83\x8F"

s.force_encoding('utf-8')
# => "ハ"

You're pointing to a ? type character, which usually means invalid. There's [no such character in Windows-1252](https://en.wikipedia.org/wiki/Windows-1252), so encoding it to that is a mistake. Why not use `force_encoding`? — tadman, Mar 22 '16 at 18:31
If the string already has the correct bytes for the UTF-8 representation, then the solution is (as @tadman says), to do `str.force_encoding('utf-8')`. That's all that's necessary. You shouldn't use `encode` if the actual bytes are already correct. — Jordan Running, Mar 22 '16 at 18:39
@Jordan, the problem is, the string's representation is `ãƒ\u008F`, `"ãƒ\u008F".force_encoding('utf-8')` is still `ãƒ\u008F` — sbs, Mar 22 '16 at 18:45
What are the actual byte values, in decimal or hex, of the string you're getting from the database? — Jordan Running, Mar 22 '16 at 18:54
*the problem is, the string's representation is ãƒ\u008F* -- You say that as if a string has a single representation. I say that the string's representation is `ハ`. Who is right? Your db is using a camera with a Windows-1252 lens to take a picture of the raw bytes. I am using a camera with a UTF-8 lens to take a picture of the raw bytes. The idea is to pick the right lens for your camera. force_encoding() will allow you to pick the right lens — 7stud, Mar 23 '16 at 04:25
@jordan, the actually value I'm getting from db is `ãƒ\u008F` — sbs, Mar 23 '16 at 04:38
What are *the actual byte values, in decimal or hex*, of the string you're getting from the database? — Jordan Running, Mar 23 '16 at 04:40
@7stud Taking your example of lens, if the raw bytes are `\xe3\x83\x8f`, then in UTF-8 lens, it will show as `ハ`; while in Windows-1252, it will show as `ãƒ�`, since `\x8f` is not defined in Windows-1252. The problem is, for some reason, the db stores `ãƒ\u008F`, which is kind of mix in between. — sbs, Mar 23 '16 at 04:43
@Jordan The actually byte values in decimal should be `\xc3\xa3\xc6\x92\xc2`, and viewing this value in UTF-8 is `ãƒ\u008F` — sbs, Mar 23 '16 at 04:46
*the actually value I'm getting from db is ãƒ\u008F* -- What is the length of your db string, and by that I mean what does *ruby* say is the length of your db string? — 7stud, Mar 23 '16 at 05:00
@7stud Sorry it's the hex value.. `"ãƒ\u008F".size # => 3`, `"ãƒ\u008F".bytesize # => 6` — sbs, Mar 23 '16 at 05:48
Can explain how you get a byte size of 6, yet you only posted 5 hex escapes in your previous comment? Can you post the output of: `str_from_db.each_byte {|byte| p byte}` — 7stud, Mar 23 '16 at 06:16
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/107094/discussion-between-sbs-and-7stud). — sbs, Mar 23 '16 at 06:43

score 1 · Accepted Answer · answered Mar 22 '16 at 22:34

I think I have a vague idea of what's going on here, but I'm having trouble formulating a proper explanation. Nevertheless, here's a solution that at least works for your one example:

str = "ãƒ\u008F"
str2 = str.chars.map {|c| c.encode('windows-1252').ord rescue c.ord }
         .pack('C*').force_encoding('utf-8')
puts str2
# => ハ

Of course, this is pretty inefficient for large texts, but hopefully it'll help. If I have the wherewithal later on I'll come back and try to add a better explanation.

I get what you doing here. To get the ord from `windows-1252` and rescue with it's own ord. Hope there's better way to do that. — sbs, Mar 23 '16 at 06:03

Encode while preserving undefined characters

1 Answers1