0

I want to convert QStrings into filenames. Since I'd like the filename to look clean, I want to replace all non-letters and non-numbers by an underscore. The following code should do that.

#include <iostream>
#include <QString>

QString makeFilename(const QString& title)
{
    QString result;
    for(QString::const_iterator itr = title.begin(); itr != title.end(); itr++)
     result.push_back(itr->isLetterOrNumber()?itr->toLower():'_');
    return result;
}

int main()
{
    QString str = "§";
    std::cout << makeFilename(str).toAscii().data() << std::endl;
}

However, on my computer, this does not work, I get as an output:

�_

Looking for an explentation, debugging tells me that QString("§").size() = 2 > 1 = QString("a").size().

My questions:

  • Why does QString use 2 QChars for "§"? (solved)
  • Do you have a solution for makeFilename? Would it also work for Chinese people?
Daniel Hedberg
  • 5,677
  • 4
  • 36
  • 61
Johannes
  • 2,901
  • 5
  • 30
  • 50

2 Answers2

1

Ok, here's my theory: when you feed the "§" literal to a QString, Qt uses some default encoding because you didn't set one. If your compiler uses UTF-8 to store string literals, you might be feeding it 2 bytes which are converted into 2 characters instead of one. Likewise, your "toAscii" output most likely does the wrong thing too.

From the looks of it, you'll have to find out what your compiler uses to store string literals, and call setCodecForCStrings with the correct value.

EDIT: given your description, if I didn't know the encoding for my compiler, I would probably try QTextCodec::codecForName("UTF-8") as parameter to the setCodec first :-)

Christian Stieber
  • 9,954
  • 24
  • 23
  • I prepended the line `QTextCodec::setCodecForCStrings ( QTextCodec::codecForName("UTF-1") );`, but did not change anything else. However, it did change nothing to my `makeFilename` function. It also failed for the other answer to this question. :( – Johannes Oct 04 '12 at 06:35
1

In addition to what others have said, keep in mind that a QString is a UTF-16 encoded string. A Unicode character that is outside of the BMP requires 2 QChar values working together, called a surrogate pair, in order to encode that character. The QString documentation says as much:

Unicode characters with code values above 65535 are stored using surrogate pairs, i.e., two consecutive QChars.

You are not taking that into account when looping through the QString. You are looking at each QChar individually without checking if it belongs to a surrogate pair or not.

Try this instead:

QString makeFilename(const QString& title) 
{ 
    QString result; 

    QString::const_iterator itr = title.begin();
    QString::const_iterator end = title.end();

    while (itr != end)
    {
        if (!itr->isHighSurrogate())
        {
            if (itr->isLetterOrNumber())
            {
                result.push_back(itr->toLower()); 
                ++itr;
                continue;
            }
        }
        else
        {
            ++itr;
            if (itr == end)
                break; // error - missing low surrogate

            if (!itr->isLowSurrogate())
                break; // error - not a low surrogate

            /*
            letters/numbers should not need to be surrogated,
            but if you want to check for that then you can use
            QChar::surrogateToUcs4() and QChar::category() to
            check if the surrogate pair represents a Unicode
            letter/number codepoint...

            uint ch = QChar::surrogateToUcs4(*(itr-1), *itr);
            QChar::Category cat = QChar::category(ch);
            if (
                ((cat >= QChar::Number_DecimalDigit) && (cat <= QChar::Number_Other)) ||
                ((cat >= QChar::Letter_Uppercase) && (cat <= QChar::Letter_Other))
                )
            {
                result.push_back(QChar(ch).toLower()); 
                ++itr;
                continue;
            }
            */
        }

        result.push_back('_');
        ++itr; 
    }

    return result; 
} 
Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
  • Thanks for giving full code. I tried it, but it did not work (I left the commented part out, but that shouldn't be the reason): § still outputs as above. Do you have an idea why? Do I need to call `setCodecForCStrings` before applying your code? – Johannes Oct 04 '12 at 06:29
  • I found out by debugging that `if (!itr->isHighSurrogate())` is being entered twice, so both seem to be low surrogates. oO – Johannes Oct 04 '12 at 06:43
  • Then the `QString` is malformed, because `§` does not use a surrogate pair in UTF-16, so you should not have 2 `QChar` elements in the string to begin with. – Remy Lebeau Oct 04 '12 at 07:34
  • I agree with @ChristianStieber. Your input is not being parsed correctly so the wrong `QString` is produced. No compiler is going to store a `char*` literal in UTF-8, so you need to use the compiler's actual encoding, or else encode the character manually in an encoding of your own choosing and then use that encoding when producing the `QString`. – Remy Lebeau Oct 04 '12 at 07:41
  • Or better, use `L"§"` with `QString::fromStdWString()`. – Remy Lebeau Oct 04 '12 at 07:49
  • `QString str = QString::fromStdWString(L"§");` – Remy Lebeau Oct 04 '12 at 07:56