2

I'm trying to write a parser for "text" files which I know will be encoded in one of the Windows single byte code pages. These files contain text representations of basic data types, and the spec I have for these representations is lacking, to say the least.

I noticed in Windows-874 ten little inconspicuous characters near the end called THAI DIGIT ZERO to THAI DIGIT NINE.

I'm trying to write this parser to be pretty robust but I'm working a bit in the dark as there are many different programs which can generate these data files and I don't have access to the sources.

What I want to know is: do any functions in Microsoft C++ libraries convert real number data types into a std::string or char const * (i.e. serialization) which would contain non-arabic-numerals?

I don't use Microsoft C++ libraries so can't reference any in particular but a made-up example could be char const * IntegerFunctions::ToString(int i).

Samuel Harmer
  • 4,264
  • 5
  • 33
  • 67
  • 1
    Windows internally uses Unicode (UTF-16); pretty much all SBCS and MBCS support is implemented by converting to and from Unicode. (The exception being very trivial functions like `strcpy`). – MSalters Jan 20 '12 at 14:30
  • It does now but hasn't always ;) my data files can come from any time period in the last ~15 years. – Samuel Harmer Jan 20 '12 at 14:51

2 Answers2

1

Sort of the inverse answer, but this page seems to indicate that Microsoft's runtime libraries at understand quite a few (but not all) non-Latin numerals when doing what you want to do, i.e. parse a string into a number.

Thai is included, which seems to indicate that it's a good idea to support it in custom code, too.

To include more information here, the linked-to page states that Microsoft's msvcr100 runtime supports decoding numerals from the following character sets:

  • ASCII
  • Arabic-Indic
  • Extended Arabic
  • Devanagari
  • Bengali
  • Gurmukhi
  • Gujarati
  • Oriya
  • Telugu
  • Kannada
  • Malayalam
  • Thai
  • Lao
  • Tibetan
  • Myanmar
  • Khmer
  • Mongolian
  • Full Width

The full page includes more programming environments and more languages (there are plenty of negatives, too).

unwind
  • 391,730
  • 64
  • 469
  • 606
  • Sorry, upon re-reading your answer, it appears you've answered the inverse routine of what I'm asking. I've updated the question to clarify. – Samuel Harmer Jan 20 '12 at 14:44
1

These digits certainly could be created by Microsoft libraries. The properties LOCALE_IDIGITSUBSTITUTION and LOCALE_SNATIVEDIGITS determine whether numbers formatted by the OS will use native (i.e. non-ASCII) digits. Those are initially Unicode, because that's what how Windows internally creates strings. When you have a Thai locale, and you convert Unicode to CP874, then those characters will be kept.

A simple function that demonstrates this behavior is GetNumberFormatA

MSalters
  • 173,980
  • 10
  • 155
  • 350
  • Would I be right in thinking that `GetNumberFormat` would be more for translation (i.e. arabic-numeral `string` to non-arabic-numeral `string`)? And it's therefore reasonable to assume no one would do this for serialization of a data type in a file which isn't meant to be hand modified? I'm essentially trying to predict what kind of number representations I am likely to encounter. – Samuel Harmer Jan 20 '12 at 14:49
  • It works both ways; i.e. you can use it to check how `"42.7"` would look in Thai digits. Any other Windows function that formats 42.7 into a string for human consumption (i.e. using the user's locale) will also apply LOCALE_IDIGITSUBSTITUTION and LOCALE_SNATIVEDIGITS. – MSalters Jan 20 '12 at 15:05
  • And what options do you have for acknowledging/ignoring locale: none, set-once, or per-call? – Samuel Harmer Jan 20 '12 at 15:49
  • Depends on the function. There's always a user locale, but some functions ignore it (especially in the kernel - no user context), some use the default, and some allow you to pass it. – MSalters Jan 20 '12 at 15:59