Are non-latin numerals in Windows SBCS codepages used by any Microsoft libraries to represent numerical data in C strings?

Question

I'm trying to write a parser for "text" files which I know will be encoded in one of the Windows single byte code pages. These files contain text representations of basic data types, and the spec I have for these representations is lacking, to say the least.

I noticed in Windows-874 ten little inconspicuous characters near the end called THAI DIGIT ZERO to THAI DIGIT NINE.

I'm trying to write this parser to be pretty robust but I'm working a bit in the dark as there are many different programs which can generate these data files and I don't have access to the sources.

What I want to know is: do any functions in Microsoft C++ libraries convert real number data types into a std::string or char const * (i.e. serialization) which would contain non-arabic-numerals?

I don't use Microsoft C++ libraries so can't reference any in particular but a made-up example could be char const * IntegerFunctions::ToString(int i).

Windows internally uses Unicode (UTF-16); pretty much all SBCS and MBCS support is implemented by converting to and from Unicode. (The exception being very trivial functions like `strcpy`). — MSalters, Jan 20 '12 at 14:30
It does now but hasn't always ;) my data files can come from any time period in the last ~15 years. — Samuel Harmer, Jan 20 '12 at 14:51

unwind · Answer 1 · 2012-01-20T13:23:31.670

Sort of the inverse answer, but this page seems to indicate that Microsoft's runtime libraries at understand quite a few (but not all) non-Latin numerals when doing what you want to do, i.e. parse a string into a number.

Thai is included, which seems to indicate that it's a good idea to support it in custom code, too.

To include more information here, the linked-to page states that Microsoft's msvcr100 runtime supports decoding numerals from the following character sets:

ASCII
Arabic-Indic
Extended Arabic
Devanagari
Bengali
Gurmukhi
Gujarati
Oriya
Telugu
Kannada
Malayalam
Thai
Lao
Tibetan
Myanmar
Khmer
Mongolian
Full Width

The full page includes more programming environments and more languages (there are plenty of negatives, too).

Sorry, upon re-reading your answer, it appears you've answered the inverse routine of what I'm asking. I've updated the question to clarify. — Samuel Harmer, Jan 20 '12 at 14:44

score 1 · Accepted Answer · answered Jan 20 '12 at 14:38

1

These digits certainly could be created by Microsoft libraries. The properties LOCALE_IDIGITSUBSTITUTION and LOCALE_SNATIVEDIGITS determine whether numbers formatted by the OS will use native (i.e. non-ASCII) digits. Those are initially Unicode, because that's what how Windows internally creates strings. When you have a Thai locale, and you convert Unicode to CP874, then those characters will be kept.

A simple function that demonstrates this behavior is GetNumberFormatA

answered Jan 20 '12 at 14:38

MSalters

173,980
10
155
350

Would I be right in thinking that `GetNumberFormat` would be more for translation (i.e. arabic-numeral `string` to non-arabic-numeral `string`)? And it's therefore reasonable to assume no one would do this for serialization of a data type in a file which isn't meant to be hand modified? I'm essentially trying to predict what kind of number representations I am likely to encounter. – Samuel Harmer Jan 20 '12 at 14:49
It works both ways; i.e. you can use it to check how `"42.7"` would look in Thai digits. Any other Windows function that formats 42.7 into a string for human consumption (i.e. using the user's locale) will also apply LOCALE_IDIGITSUBSTITUTION and LOCALE_SNATIVEDIGITS. – MSalters Jan 20 '12 at 15:05
And what options do you have for acknowledging/ignoring locale: none, set-once, or per-call? – Samuel Harmer Jan 20 '12 at 15:49
Depends on the function. There's always a user locale, but some functions ignore it (especially in the kernel - no user context), some use the default, and some allow you to pass it. – MSalters Jan 20 '12 at 15:59

Are non-latin numerals in Windows SBCS codepages used by any Microsoft libraries to represent numerical data in C strings?

2 Answers2