How to convert Unicode characters to escape codes

Question

So, I have a bunch of strings like this: {\b\cf12 よろてそ } . I'm thinking I could iterate over each character and replace any unicode (Edit: Anything where AscW(char) > 127 or < 0) with a unicode escape code (\u###). However, I'm not sure how to programmatically do so. Any suggestions?

Clarification:

I have a string like {\b\cf12 よろてそ } and I want a string like {\b\cf12 [STUFF]}, where [STUFF] will display as よろてそ when I view the rtf text.

In VB6 all strings a unicode, can you therefore clarify, do you believe you are accidentally reading something that is UTF-8 as if it were Unicode or a OEM page code? — AnthonyWJones, Jun 18 '09 at 21:55
Also why do you want this? What are you going to do with strings with these escape codes in? — AnthonyWJones, Jun 18 '09 at 21:56
@Anythony: I want this because I have some dynamically generated strings that are mixing RTF and unicode together, which cannot be displayed properly since RTF is an 8bit format. — Brian, Jun 18 '09 at 22:01
As an aside, some of these strings are actually statically generated strings mixing unescaped unicode and rtf together. — Brian, Jun 18 '09 at 22:02

AnthonyWJones · Accepted Answer · 2009-06-19T16:07:00.683

3

You can simply use the AscW() function to get the correct value:-

sRTF = "\u" & CStr(AscW(char))

Note unlike other escapes for unicode, RTF uses the decimal signed short int (2 bytes) representation for a unicode character. Which makes the conversion in VB6 really quite easy.

Edit

As MarkJ points out in a comment you would only do this for characters outside of 0-127 but then you would also need to give some other characters inside the 0-127 range special handling as well.

edited Jun 19 '09 at 16:07

answered Jun 19 '09 at 09:15

AnthonyWJones

187,081
35
232
306

1

You could do this for all char values above 127. Chars of 127 and below 127 are the same in all code pages and can probably be left alone – MarkJ Jun 19 '09 at 13:23
@MarkJ: Agreed, I should probably have pointed that out, the question uses 256 which is wrong. – AnthonyWJones Jun 19 '09 at 13:31
2

Numbers below 0 also need to be converted. – Brian Jun 19 '09 at 14:57
@Brian: yep that too, adjust answer yet again :) – AnthonyWJones Jun 19 '09 at 16:08
Unicode codepoints are all > 0. If you are characters them as integers, then they will appear to be < 0 because VB6 doesn't have a 16-bit unsigned data type. Also, be sure to account for surrogate pairs – rpetrich Jun 27 '09 at 13:59
@rpetrich: we understand that the code points do not have sign, which is why I missed it in an edit of my answer. Do you think think that surrogate pairs really need any special handling in this case? Would they not be encoded into an RTF as surrogate pairs anyway? – AnthonyWJones Jun 27 '09 at 19:32
I can't be certain as I'm not an RTF guru, but it seems like the standard way is to encode surrogate pairs as two separate characters. Example: U+1D44E would become \u-10187?\u-9137? (with ? as the fallback character for both codepoints) – rpetrich Jun 28 '09 at 21:35
@rpetrich: VB6 would have no way to return the value &H1D44E since its beyond the signed integer range. Internally VB6 uses 2-byte unicode characters so would use the surrogate pair approach to encoding these upper characters. ChrW/AscW treats each value in a surrogate pair as an independant character. I would also expect the RTF to at least understand surrogate pairs, hence I don't think any extra action is needed by this code. – AnthonyWJones Jun 29 '09 at 08:36

score 0 · Answer 2 · answered Jun 30 '09 at 15:14

Another more roundabout way, would be to add the MSScript.OCX to the project and interface with VBScript's Escape function. For example

Sub main()
    Dim s As String
    s = ChrW$(&H3088) & ChrW$(&H308D) & ChrW$(&H3066) & ChrW$(&H305D)
    Debug.Print MyEscape(s)
End Sub

Function MyEscape(s As String) As String
    Dim scr As Object
    Set scr = CreateObject("MSScriptControl.ScriptControl")
    scr.Language = "VBScript"
    scr.Reset
    MyEscape = scr.eval("escape(" & dq(s) & ")")
End Function

Function dq(s)
    dq = Chr$(34) & s & Chr$(34)
End Function

The Main routine passes in the original Japanese characters and the debug output says:

%u3088%u308D%u3066%u305D

HTH

You should be aware that MS Script Control is not supported on Vista. — MarkJ, Jul 02 '09 at 06:04
By "not supported" does that mean, "doesn't work" or "if it breaks, or breaks the O/S, no one's going to help me"? — bugmagnet, Jul 02 '09 at 12:07

How to convert Unicode characters to escape codes

2 Answers2