6

I know there is plenty of information about converting QString to char*, but I still need some clarification in this question.

Qt provides QTextCodecs to convert QString (which internally stores characters in unicode) to QByteArray, allowing me to retrieve char* which represents the string in some non-unicode encoding. But what should I do when I want to get a unicode QByteArray?

QTextCodec* codec = QTextCodec::codecForName("UTF-8");
QString qstr = codec->toUnicode("Юникод");
std::string stdstr(reinterpret_cast<const char*>(qstr.constData()), qstr.size() * 2 );  // * 2 since unicode character is twice longer than char
qDebug() << QString(reinterpret_cast<const QChar*>(stdstr.c_str()), stdstr.size() / 2); // same

The above code prints "Юникод" as I've expected. But I'd like to know if that is the right way to get to the unicode char* of the QString. In particular, reinterpret_casts and size arithmetics in this technique looks pretty ugly.

Oleg Andriyanov
  • 5,069
  • 1
  • 22
  • 36
  • @ratchetfreak you mean UTF8 and Unicode are equal? – Oleg Andriyanov Apr 03 '14 at 14:01
  • UTF8 is the byte sized unicode format, internally the QString uses UTF16, you could also grab the `data()` – ratchet freak Apr 03 '14 at 14:04
  • QString is already "юникодед". So simply call `str.toStdWString()`. `std::string` is not designed to store 16-bit characters. – Dmitry Sazonov Apr 03 '14 at 14:21
  • 3
    "you mean UTF8 and Unicode are equal" No. Your use of the word Unicode is wrong. Unicode is not an encoding, it's a standard, so talking of a "Unicode std::string" doesn't mean anything. A string by itself can't be unicode compliant. An `std::string` will have a particular "character" type (usually either 8 or 16 bits wide), and it will have a particular encoding (UCS-2 or UTF-16 for 16 bit characters, usually). The big difference between UCS-2 and UTF-16 is that UCS-2 is fixed-width: one code point per "character". In UTF-16, there may be multiple "characters" per code point. – Kuba hasn't forgotten Monica Apr 03 '14 at 14:51
  • The phrase "unicode QByteArray" is **meaningless**. It is equivalent to saying "wakalixes QByteArray". A byte array can carry text data in some 8-bit encoding, such as Latin1 (ISO/IEC 8859-1), or UTF-8, etc. If you want an 8-bit encoded byte array as a representation of a string, **you need to know what encoding is expected by the user of such an array**. Only then can you decide how to encode the string. – Kuba hasn't forgotten Monica Apr 03 '14 at 14:55
  • Please edit your question's title to indicate what encoding is desired in the `std::string`, and whether the string is 8- or 16-bits wide. – Kuba hasn't forgotten Monica Apr 03 '14 at 15:00
  • OK, presuming that it is indeed `std::string` and not `std::wstring`, the string is 8 bit wide, but the encoding question still remains. – Kuba hasn't forgotten Monica Apr 03 '14 at 15:06
  • I was wrong claiming that `QString` used UCS-2 internally, it uses UTF-16 and you can get two `QChar`s per Unicode code point. So you can actually represent all Unicode code points in a `QString`. – Kuba hasn't forgotten Monica Apr 03 '14 at 17:06
  • @KubaOber I know what you are trying to say, but still a nitpick: I wouldn't say "a byte array can carry text data in some 8-bit encoding". A byte array is also a block of memory, and can carry text (or any) data in any encoding (or format) that can be stored in computer memory. It's just easier to access the data if it is in 8 bits wide pieces, always aligned at byte boundary. – hyde Apr 05 '14 at 17:21

3 Answers3

9

The below applies to Qt 5. Qt 4's behavior was different and, in practice, broken.

You need to choose:

  1. Whether you want the 8-bit wide std::string or 16-bit wide std::wstring, or some other type.

  2. What encoding is desired in your target string?

Internally, QString stores UTF-16 encoded data, so any Unicode code point may be represented in one or two QChars.

Common cases:

  • Locally encoded 8-bit std::string (as in: system locale):

    std::string(str.toLocal8Bit().constData())
    
  • UTF-8 encoded 8-bit std::string:

    str.toStdString()
    

    This is equivalent to:

    std::string(str.toUtf8().constData())
    
  • UTF-16 or UCS-4 encoded std::wstring, 16- or 32 bits wide, respectively. The selection of 16- vs. 32-bit encoding is done by Qt to match the platform's width of wchar_t.

    str.toStdWString()
    
  • U16 or U32 strings of C++11 - from Qt 5.5 onwards:

    str.toStdU16String()
    str.toStdU32String()
    
  • UTF-16 encoded 16-bit std::u16string - this hack is only needed up to Qt 5.4:

    std::u16string(reinterpret_cast<const char16_t*>(str.constData()))
    

    This encoding does not include byte order marks (BOMs).

It's easy to prepend BOMs to the QString itself before converting it:

QString src = ...;
src.prepend(QChar::ByteOrderMark);
#if QT_VERSION < QT_VERSION_CHECK(5,5,0)
auto dst = std::u16string{reinterpret_cast<const char16_t*>(src.constData()),
                          src.size()};
#else
auto dst = src.toStdU16String();

If you expect the strings to be large, you can skip one copy:

const QString src = ...;
std::u16string dst;
dst.reserve(src.size() + 2); // BOM + termination
dst.append(char16_t(QChar::ByteOrderMark));
dst.append(reinterpret_cast<const char16_t*>(src.constData()),
           src.size()+1);

In both cases, dst is now portable to systems with either endianness.

Kuba hasn't forgotten Monica
  • 95,931
  • 16
  • 151
  • 313
2

Use this:

QString Widen(const std::string &stdStr)
{
    return QString::fromUtf8(stdStr.data(), stdStr.size());
}

std::string Narrow(const QString &qtStr)
{
    QByteArray utf8 = qtStr.toUtf8();
    return std::string(utf8.data(), utf8.size());
}

In all cases you should have utf8 in std::string.

Pavel Radzivilovsky
  • 18,794
  • 5
  • 57
  • 67
  • Why is `stdStr.size()` necessary when calling fromUtf8? Does that result in storing the terminating null in the QString? Otherwise, it appears `fromUtf8` defaults to reading up to the terminating null... – Len Jan 23 '18 at 01:15
0

You can get the QByteArray from a UTF-16 encoded QString using this:

QTextCodec *codec = QTextCodec::codecForName("UTF-16");
QTextEncoder *encoderWithoutBom = codec->makeEncoder( QTextCodec::IgnoreHeader );
QByteArray array  = encoderWithoutBom->fromUnicode( str );

This way you ignore the unicode byte order mark (BOM) at the beginning.

You can convert it to char * like:

int dataSize=array.size();
char * data= new char[dataSize];
for(int i=0;i<dataSize;i++)
{
    data[i]=array[i];
}

Or simply:

char *data = array.data();
Nejat
  • 31,784
  • 12
  • 106
  • 138
  • 3
    There is no such thing as a "unicode byte array" - please stop using this term, it confuses everyone. Unicode is a standard, not an encoding. There's UTF-16 and UCS-2, and the latter is what `QString` is internally encoded as. UCS-2 is a subset of UTF-16 for code points 0-0xFFFF. Since a `QString` can't carry code points outside of that range, you don't need to do anything special to get UTF-16 out of a `QString`. Just use the string's `constData()`. – Kuba hasn't forgotten Monica Apr 03 '14 at 14:49
  • @KubaOber Using constData() also gets you the BOM at the begging which is a mess. Using the mentioned approach you can get the QByteArray related to string and also you can use different encoding options. – Nejat Apr 03 '14 at 15:40
  • Are you sure that `QString` stores the embedded BOM? – Kuba hasn't forgotten Monica Apr 03 '14 at 15:58
  • Yeah definitely. You can see http://stackoverflow.com/questions/3602548/qt-converting-qstring-to-unicode-qbytearray?rq=1 – Nejat Apr 03 '14 at 16:16
  • The first answer in your link seems to contradict you. – Kuba hasn't forgotten Monica Apr 03 '14 at 16:58
  • 1
    In fact, I've just checked, and `QString` does not carry an embedded BOM. It'd be a waste of space. This code would dump out the BOM; it doesn't: `QString str1(QStringLiteral("A")); const QChar * p = str1.constData(); while (p->unicode()) qDebug() << *p++;` – Kuba hasn't forgotten Monica Apr 03 '14 at 17:36