Short version
- Given:
1/16/2006 2∶30∶11 ᴘᴍ
- How to get:
1/16/2006 2:30:11 PM
- rather than:
?1/?16/?2006 ??2:30:11 ??
Background
I have an example Unicode (UTF-16) encoded string:
U+200e U+0031 U+002f U+200e U+0031 U+0036 U+002f U+200e U+0032 U+0030 U+0030 U+0036 U+0020 U+200f U+200e U+0032 U+2236 U+0033 U+0030 U+2236 U+0031 U+0031 U+0020 U+1d18 U+1d0d
[LTR] 1 / [LTR] 1 6 / [LTR] 2 0 0 6 [RTL] [LTR] 2 ∶ 3 0 ∶ 1 1 ᴘ ᴍ
In a slightly easier to read form is:
LTR1/LTR16/LTR2006 RTLLTR2∶30∶11 ᴘᴍ
The actual final text as you're supposed to see it is:
I currently use the Windows function WideCharToMultiByte to convert the UTF-16 to the local code-page:
WideCharToMultiByte(CP_ACP, 0, text, length, NULL, 0, NULL, NULL);
and when i do the text comes out as:
?1/?16/?2006 ??2:30:11 ??
I don't control the presence of the Unicode text direction markers; it's a security thing. But obviously when i'm converting the Unicode to (for example) ISO-8859-1, those characters are irrelevant, make no sense, and i would hope can be dropped.
Is there a Windows function (e.g. FoldString
, WideCharToMultiByte
) that can be instructed to drop these non-mappable non-printable character?
1/16/2006 2∶30∶11 ᴘᴍ
That gets us close
If a function did that, dropped the non-printing characters that don't have a representation in the target code-page, we would get:
1/16/2006 2∶30∶11 ᴘᴍ
When converted to ISO-8859-1, it becomes:
1/16/2006 2?30?11 ??
That's because some of those characters don't map exactly into ISO-8859-1:
1/16/2006 2U+223630U+223611 U+1d18U+1d0d
1/16/2006 2RATIO30RATIO11 Small Capital PSmall Capital M
But when you see them, it doesn't seem unreasonable that they could be best-fit mapped into:
- Original:
1/16/2006 2∶30∶11 ᴘᴍ
- Mapped:
1/16/2006 2:30:11 PM
Is there a function that can do that?
I'm happy to suffer with:
- 1/16/2006 2?30?11 ??
But i really need to fix:
- ?1/?16/?2006 ??2:30:11 ??
Unicode has the notion
Unicode already has the notion of what "fancy" character you can replace with what "normal" character.
- U+00BA º → o (Masculine ordinal indicator) → (Small latin letter o, superscripted)
- U+FF0F / → / (Fullwidth solidus) → (solidus, wide)
- U+00BC ¼ → 1/4 (Vulgar fraction one quarter)
- U+2033 ″ → ′′ (Double prime)
- U+FE64: ﹤ → <
I know these are technically for a different purpose;. But there is also the general notion of a mapping list (which again is for a different purpose).
Microsoft SQL Server, when being asked to insert a Unicode string into a non-unicode varchar
column does an even better job:
Is there a mapping list for the purpose of unicode best-fit?
Because the reality is that it just makes a mess for users: