How make QChar.unicode() report the utf-16 representation of combined characters?

Question

I'm trying to write a codec for Code page 437. My plan was to just pass the ASCII characters through and map the remaining 128 characters in a table, using the utf-16 value as key.

For some combined charaters (letters with dots, tildes etcetera), the character appears to occupy two QChars.

A test program that prints the utf-16 values for the arguments to the program:

#include <iostream>
#include <QString>

using namespace std;

void print(QString qs)
{
    for (QString::iterator it = qs.begin(); it != qs.end(); ++it)
        cout << hex << it->unicode() << " ";
    cout << "\n";
}

int main(int argc, char *argv[])
{
    for (int i = 1; i < argc; i++)
        print(QString::fromStdString(argv[i]));
}

Some output:

$ ./utf16 Ç ü é
c3 87 
c3 bc 
c3 a9

I had expected

c387
c3bc
c3a9

Tried the various normalizationsforms avaialable in QString but no one had fewer bytes than the default.

Since QChar is 2 bytes it should be able to hold the value of the characters above in one object. Why does the QString use two QChars? How can I fetch the combined unicode value?

You know `cout` prints bytes at a time right? How many 2-byte characters did you print? — rubenvb, Jun 11 '12 at 10:03
@rubenvb qs.length() confirms that the string consists of two bytes. — Daniel Näslund, Jun 11 '12 at 10:41
If there's anyone else who needs to write a cp437 codec, there's a [mapping](ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/CP437.TXT) on the unicode.org site. — Daniel Näslund, Jun 11 '12 at 11:15

Kuba hasn't forgotten Monica · Accepted Answer · 2012-06-11T11:00:54.357

3

QString::fromStdString expects an ASCII string and doesn't do any decoding. Use fromLocal8Bit instead.
Your expected output is wrong. For example, Ç is U+00C7, so you should expect C7, not the UTF-8 encoding of C3 87!

If you modify main() as below, you get the expected Unicode code points. For each character, the first line lists the local encoding (here: Utf-8), since fromStdString is essentially a no-op and passes everything straight. The second line lists the correctly decoded Unicode code point index.

$ ./utf16 Ç ü é
c3 87 
c7 
c3 bc 
fc 
c3 a9 
e9

int main(int argc, char *argv[])
{
    for (int i = 1; i < argc; i++) {
        print(QString::fromStdString(argv[i]));
        print(QString::fromLocal8Bit(argv[i]));
    }
}

edited Jun 11 '12 at 11:00

answered Jun 11 '12 at 10:54

Kuba hasn't forgotten Monica

95,931
16
151
313

Ah. 1 was caused by a misconception where I somehow got the impression that (the docs are pretty clear) that QString would try to interpret the encoding. That's [not really doable in a consistent way](http://blogs.msdn.com/b/oldnewthing/archive/2007/04/17/2158334.aspx) 2. I thought that QChar::unicode() returned the actual byte representation and not the code points. This solves my problem and takes me a small step further against unicode enlightment. I'll wait a while to see if anything useful comes up. (Some people tend to not click on solved questions). – Daniel Näslund Jun 11 '12 at 11:12
1

What else do you want to know?? QString is internally UTF-16 based and it doesn't do any sort of encoding or decoding -- just imagine how bad it would be to have to deal with encoding every time you try to operate on the strings! – Kuba hasn't forgotten Monica Jun 11 '12 at 11:31
1

How do you expect `QChar::unicode()` to return "the actual byte representation"? There is no byte representation until you know the encoding. So, I ask, how would QChar divine the encoding you expect? QChar represents Unicode codepoints. Encoding is a different matter entirely, and is handled by `QTextCodec`s. – Kuba hasn't forgotten Monica Jun 11 '12 at 11:34
1

The link you have provided is not all that applicable to QString. Qt uses the locally selected 8 bit encoding by default (from $LANG on Unices), and that's it. If you deal with I/O (say, files) that are encoded differently, it's up to you to figure out what encoding to tell Qt to use when converting bytes to strings – Kuba hasn't forgotten Monica Jun 11 '12 at 11:37
My thinking went something like: If we store the code points, then we have to do a lookup for each char when we want to print a string. That sounds expensive. And I don't recognize those code points (the original output from my test script). Perhaps I'm dealing with the raw byte representation. As for the performance, I guess 0(1) table lookup won't be too costly. Doing the actual I/O is an order of magnitude worse. – Daniel Näslund Jun 11 '12 at 11:43
@Kurt Ober. re the link: I was merely trying to make the point that guessing what sort of encoding we're dealing with is problematic due to the presence/absence of BOM's. And I understand that the encoding of a specific file is not determined from the current enviroment. How do you deal with the possibility of "the wrong encoding" for input files? – Daniel Näslund Jun 11 '12 at 11:51
If it's a raw text editor, the user needs to be able to give encoding both upon opening the file and at any later point, even though it may force a reload of the file. Internally you want to deal with only one encoding, thus QString's/QChar's UTF-16 is as good a choice as any. – Kuba hasn't forgotten Monica Jun 11 '12 at 12:58
If you're dealing with custom file formats, you better knew what encoding you're using. If you're dealing with XML, the encoding can be given in the stream itself. You switch encodings after parsing the encoding given in the `` tag. – Kuba hasn't forgotten Monica Jun 11 '12 at 13:00

score 0 · Answer 2 · edited May 23 '17 at 11:44

0

Just sidestep the problem. See QApplication in Unicode. QApplication::arguments is already UTF-16 encoded for you taking local conventions into account.

edited May 23 '17 at 11:44

Community

1
1

answered Jun 11 '12 at 14:17

MSalters

173,980
10
155
350

How make QChar.unicode() report the utf-16 representation of combined characters?

2 Answers2