6

I have a problem with tranformation from QString to QByteArray and then back to QString:

int main() {

    QString s;

    for(int i = 0; i < 65536; i++) {
        s.append(QChar(i));
    }

    QByteArray ba = s.toUtf8();

    QString s1 = QString::fromUtf8(ba);

    if(areSame(s, s1)) {
        qDebug() << "OK";
    } else {
       qDebug() << "FAIL";
       outputErrors(s, s1);
    }

    return 0;
}

As you can see I fill QString with all characters that are within 16bit range. and then convert them to QByteArray (Utf8) and back to QString. The problem is that the character with value 0 and characters with value larger than 55295 fail to convert back to QString.

If I stay within range 1 to < 55297 this test passes.

sandwood
  • 2,038
  • 20
  • 38
JanSLO
  • 388
  • 2
  • 9

2 Answers2

5

I had a task to convert std::string to QString, and QString to QByteArray. Following is what I did in order to complete this task.

std::string str = "hello world";

QString qstring = QString::fromStdString(str);

QByteArray buffer;

If you look up the documentation for "QByteArray::append", it takes QString and returns QByteArray.

buffer = buffer.append(str);
mandroid
  • 159
  • 1
  • 5
3

The characters from 55296 (0xD800) up to 57343 (0xdfff) are surrogate characters. You can see it as an escape character for the character after it. They have no meaning in itself.

You can check it by running:

// QChar(0) was omitted so s and s1 start with QChar(1)
for (int i = 1 ; i < 65536 ; i++)
{
    qDebug() << i << QChar(i) << s[i-1]  << s1[i-1] << (s[i-1] == s1[i-1]);
}
lost_in_the_source
  • 10,998
  • 9
  • 46
  • 75
Ronald Klop
  • 116
  • 3
  • correct me if I'm wrong, but wouldn't the strings still be equal? – user_4685247 Jul 23 '16 at 12:36
  • 1
    When calling QString::toUtf8() codepoints U+D800 to U+DFFF are replaced by 0x3F which is '?'. That's where the information is lost. – Benjamin T Jul 23 '16 at 12:55
  • They're not "escape characters" -- the combined value of a surrogate, with the one after it, encodes a code point. If you have a disjoined sequence of surrogates then your encoding is broken, and Qt is allowed to do anything with it. Including replacing stray surrogates with a `'?'`. – peppe Jul 23 '16 at 18:04