2

Short version

  • Given:         1/16/2006 2∶30∶11 ᴘᴍ
  • How to get: 1/16/2006 2:30:11 PM
  • rather than: ?1/?16/?2006 ??2:30:11 ??

Background

I have an example Unicode (UTF-16) encoded string:

U+200e U+0031 U+002f U+200e U+0031 U+0036 U+002f U+200e U+0032 U+0030 U+0030 U+0036 U+0020 U+200f U+200e U+0032 U+2236 U+0033 U+0030 U+2236 U+0031 U+0031 U+0020 U+1d18 U+1d0d
 [LTR]      1      /  [LTR]      1      6      /  [LTR]      2      0      0      6         [RTL]  [LTR]      2      ∶      3      0      ∶       1      1             ᴘ      ᴍ

In a slightly easier to read form is:

LTR1/LTR16/LTR2006 RTLLTR2∶30∶11 ᴘᴍ

The actual final text as you're supposed to see it is:

enter image description here

I currently use the Windows function WideCharToMultiByte to convert the UTF-16 to the local code-page:

WideCharToMultiByte(CP_ACP, 0, text, length, NULL, 0, NULL, NULL);

and when i do the text comes out as:

?1/?16/?2006 ??2:30:11 ??

I don't control the presence of the Unicode text direction markers; it's a security thing. But obviously when i'm converting the Unicode to (for example) ISO-8859-1, those characters are irrelevant, make no sense, and i would hope can be dropped.

Is there a Windows function (e.g. FoldString, WideCharToMultiByte) that can be instructed to drop these non-mappable non-printable character?

1/16/2006 2∶30∶11 ᴘᴍ

That gets us close

If a function did that, dropped the non-printing characters that don't have a representation in the target code-page, we would get:

1/16/2006 2∶30∶11 ᴘᴍ

When converted to ISO-8859-1, it becomes:

1/16/2006 2?30?11 ??

That's because some of those characters don't map exactly into ISO-8859-1:

1/16/2006 2U+223630U+223611 U+1d18U+1d0d

1/16/2006 2RATIO30RATIO11 Small Capital PSmall Capital M

But when you see them, it doesn't seem unreasonable that they could be best-fit mapped into:

  • Original: 1/16/2006 2∶30∶11 ᴘᴍ
  • Mapped: 1/16/2006 2:30:11 PM

Is there a function that can do that?

I'm happy to suffer with:

  • 1/16/2006 2?30?11 ??

But i really need to fix:

  • ?1/?16/?2006 ??2:30:11 ??

Unicode has the notion

Unicode already has the notion of what "fancy" character you can replace with what "normal" character.

I know these are technically for a different purpose;. But there is also the general notion of a mapping list (which again is for a different purpose).

Microsoft SQL Server, when being asked to insert a Unicode string into a non-unicode varchar column does an even better job:

enter image description here

Is there a mapping list for the purpose of unicode best-fit?

Because the reality is that it just makes a mess for users:

enter image description here

Ian Boyd
  • 246,734
  • 253
  • 869
  • 1,219
  • There is no Win32 API function to do what you are asking for. You need to manually strip out unwanted chars from the string, and replace unmapped chars as needed, before then passing the string to `WideCharToMultiByte()`. You are dealing with a very specific use case, so it should be fairly straight forward to do the modifications manually. – Remy Lebeau Apr 03 '19 at 18:08

0 Answers0