How does ruby represent strings internally?

Question

I ran into some trouble while creating a C-Extension for ruby that got me thinking. I wonder how Ruby (1.9.1) handles strings (and all the encoding-stuff) internally?

If I have a string like "o", and I pass the string to a C-Function (as VALUE), I can deal with it pretty easily using the RSTRING_PTR() and the RSTRING_LEN() macro. However, if I make the string ö (a german umlaut character), RSTRING_LEN() will give me 2.

I'm a bit stumped on the contents of RSTRING_PTR() in that case, the two bytes are 0xA4 and 0xC3. What encoding is this? I tried using "ö".force_encoding( ... ) with different encodings before passing the string to the C-function, but that does not affect the contents of RSTRING_PTR at all.

What I need is a way to have the string represented as a WCHAR* encoded in UTF-16 (in the case of "ö", that would be 0x00F6) in my C-function, but that's kinda hard to do if you do not know what encoding you're coming from...

thx for any help in advance

`force_encoding` isn't supposed to change the contents of the string, it just changes how the string is read. — Cubic, Sep 17 '12 at 12:38

score 2 · Accepted Answer · answered Jun 27 '12 at 11:54

2

String internals in ruby 1.9 depends on __ENCODING__ constant and Encoding.default_internal setting.

In your case it looks like UTF-8 (default), but ö is actually c3 b6 in UTF-8, and c3 a4 is ä

answered Jun 27 '12 at 11:54

zed_0xff

32,417
7
53
72

oh yeah, you're right I mixed up my testcases. Thx for the help, conversion works now =) – DeX3 Jun 27 '12 at 12:16

How does ruby represent strings internally?

1 Answers1