2

Say I have a function that receives a byte array:

void fcn(byte* data)
{
...
}

Does anyone know a reliable way for fcn() to determine if data is an ANSI string or a Unicode string?

Note that I'm intentionally NOT passing a length arg, all I receive is the pointer to the array. A length arg would be a great help, but I don't receive it, so I must do without.

This article mentions an OLE API that apparently does it, but of course they don't tell you WHICH api function: http://support.microsoft.com/kb/138142

user774411
  • 1,749
  • 6
  • 28
  • 47
dlchambers
  • 3,511
  • 3
  • 29
  • 34
  • 4
    There is no *reliable* way, but [IsTextUnicode](http://msdn.microsoft.com/en-us/library/windows/desktop/dd318672\(v=vs.85\).aspx) is probably what's meant. – user786653 Oct 03 '11 at 21:30
  • 1
    Do you know for a fact that it contains a string of non-zero length? Do you know for a fact that reading a few bytes pass the end of the sting isn't fatal? I think the answer is clearly that no *reliable* way can exist. There are strings that are both valid ASCII and valid Unicode. – David Schwartz Oct 03 '11 at 21:32
  • Where is the byte array coming from? – i_am_jorf Oct 03 '11 at 21:36
  • 3
    See http://en.wikipedia.org/wiki/Bush_hid_the_facts for an amusing example of how this can go wrong. – Greg Hewgill Oct 03 '11 at 21:37
  • You'll run past the end of the string and possibly crash and burn if you don't know the length, but want to check for both ascii and utf16/32 content – nos Oct 03 '11 at 21:41
  • By "ANSI", do you mean "ASCII"? There is no predefined type `byte` in C++; is it a typedef for `unsigned char`? Unicode is not a data representation; there are several ways to represent Unicode text, including UTF-8, UTF-16, UCS-2, and UTF-32. – Keith Thompson Oct 03 '11 at 21:42
  • @nos: That was my thought, but I think if `data` points to the beginning of a valid string of some kind, you can safely scan up to the first zero byte. – Keith Thompson Oct 03 '11 at 21:43
  • Keith: ASCII is a 7-bit encoding, ANSI is the historic name on Windows referring to the legacy codepage. Those two are not the same. – Joey Oct 03 '11 at 21:43
  • 2
    Why on Earth doesn't the caller tell you what kind of string it is? Surely that information exists, or existed, when the string was created. Why can't you redesign the function so the caller tells you what it is? – Keith Thompson Oct 03 '11 at 21:44
  • 1
    @Joey: I see. [Windows-1252](http://en.wikipedia.org/wiki/Windows-1252) is often *incorrectly* referred to as "ANSI", even though it was never an ANSI standard. It's a superset of [ISO 8859-1](http://en.wikipedia.org/wiki/ISO/IEC_8859-1), also known as Latin-1. – Keith Thompson Oct 03 '11 at 21:48
  • @Keith Thompson Yes, but up to the first zero byte is not very useful , you will reject most UTF16/32 strings. – nos Oct 03 '11 at 21:48
  • 1
    And given that the OP referred to it as ANSI, a "Unicode string" probably means UTF-16. – Keith Thompson Oct 03 '11 at 21:48
  • Not sure why someone would downvote my question. I'm working inside someone else's mess and just trying to make the best of it. Keith asked "Why on Earth doesn't the caller tell you what kind of string it is?" Why indeed! Why didn't the EU make stricter rules on debt? Sometimes you're given lemons and are just trying to make lemonade. – dlchambers Nov 16 '11 at 15:36

1 Answers1

5

First, a word on terminology. There is no such thing as an ANSI string; there are ASCII strings, which represents a character encoding. ASCII was developed by ANSI, but they're not interchangable.

Also, there is no such thing as a Unicode string. There are Unicode encodings, but those are only a part of Unicode itself.

I will assume that by "Unicode string" you mean "UTF-8 encoded codepoint sequence." And by ANSI string, I'll assume you mean ASCII.

If so, then every ASCII string is also a UTF-8 string, by the definition of UTF-8's encoding. ASCII only defines characters up to 0x7F, and all UTF-8 code units (bytes) up to 0x7F mean the same thing as they do under ASCII.

Therefore, your concern would be for the other 128 possible values. That is... complicated.

The only reason you would ask this question is if you have no control over the encoding of the string input. And therefore, the problem is that ASCII and UTF-8 are not the only possible choices.

There's Latin-1, for example. There are many strings out there that are encoded in Latin-1, which takes the other 128 bytes that ASCII doesn't use and defines characters for them. That's bad, because those other 128 bytes will conflict with UTF-8's encoding.

There are also code pages. Many strings were encoded against a particular code page; this is particularly so on Windows. Decoding them requires knowing what codepage you're working on.

If you are in a situation where you are certain that a string is either ASCII (7-bit, with the high bit always 0) or UTF-8, then you can make the determination easily. Either the string is ASCII (and therefore also UTF-8), or one or more of the bytes will have the high bit set to 1. In which case, you must use UTF-8 decoding logic.

Unless you are truly certain of that these are the only possibilities, you are going to need to do a bit more. You can validate the data by trying to run it through a UTF-8 decoder. If it runs into an invalid code unit sequence, then you know it isn't UTF-8. The problem is that it is theoretically possible to create a Latin-1 string that is technically valid UTF-8. You're kinda screwed at that point. The same goes for code page-based strings.

Ultimately, if you don't know what encoding the string is, there's no guarantee you can display it properly. That's why it's important to know where your strings come from and what they mean.

Nicol Bolas
  • 449,505
  • 63
  • 781
  • 982
  • 2
    when it comes to byte arrays, `ANSI` commonly refers to any non-`ASCII` multi-byte encoding with character values above 127 that rely on codepages, and `Unicode` commonly refers to `UTF-16` more than `UTF-8`. – Remy Lebeau Oct 04 '11 at 06:09
  • @RemyLebeau-TeamB: Encouraging people to refer to UTF-16 as "Unicode" is a horrible practice and should never be done. This goes _double_ if you're talking about a "string" that you have a `byte*` to. – Nicol Bolas Oct 04 '11 at 06:13
  • I'm not encouraging anything, I'm merely pointing out what I have observed in practice. – Remy Lebeau Oct 04 '11 at 06:18
  • 1
    Yes, using “Unicode” to mean the UTF-16LE encoding is unfortunate and confusing, but it is standard terminology in the Microsoft world (as is the equally-misleading ‘ANSI’) so we have no choice but to deal with it. This stems from Microsoft's early implementation of Unicode when it was believed that UCS-2 was the only way anyone would ever interact with Unicode strings. This mistake survives to this day with Windows's second-class support for UTF-8. – bobince Oct 05 '11 at 12:27
  • 1
    I have noticed the same tendency in the Linux and web world to use "Unicode" for UTF-8. That is as wrong as the Windows lingo. – Mihai Nita Oct 07 '11 at 08:37