0
std::string arrWords[10];
std::vector<std::string> hElemanlar;

......

this->hElemanlar.push_back(std::string(1, this->arrWords[sayKelime][j]).c_str());

......

What i am doing is: Every element of arrWord is a std::string. I get the n th element of arrWord and then push them into hElemanlar.

Assuming arrWords[0] is "test", then:

this->hElemanlar.push_back("t");
this->hElemanlar.push_back("e");
this->hElemanlar.push_back("s");
this->hElemanlar.push_back("t");

And my problem is although i have no encoding problems with arrWords, some utf-8 characters are not printed or treated well in hElemanlar. How can i fix it?s

gokturk
  • 116
  • 2
  • 13

1 Answers1

1

If you know that arrWords[i] contains UTF-8 encoded text, then you probably need to split the strings into complete Unicode characters.

As an aside, rather than saying:

this->hElemanlar.push_back(std::string(1, this->arrWords[sayKelime][j]).c_str());

(which constructs a temporary std::string, obtains a the c-string representation of it, constructs another temporary string, and pushes that onto the vector), say:

this->hElemanlar.push_back(std::string(1, this->arrWords[sayKelime][j]))

Anyway. This will need to become something like:

std::string str(1, this-arrWords[sayKelime][j])
if (static_cast<unsigned char>(str[0]) >= 0xC0)
{
   for (const char c = this-arrWords[sayKelime][j+1];
        static_cast<unsigned char>(c) >= 0x80;
        j++)
   {
       str.push_back(c);
   }
}
this->hElemenlar.push_back(str);

Note that the above loop is safe, because if j is the index of the last char in the string, [j+1] will return the nul-terminator (which will end the loop). You will need to consider how incrementing j interacts with the rest of your code though.

You then need to consider whether you want hElemanlar to represent individual Unicode code points (which this does), or do you want to include a character + all the combining characters that follow? In the latter case, you would have to extend the code above to:

  • Parse the next code-point
  • Decide whether it is a combining character
  • Push the UTF-8 sequence on the string if so.
  • Repeat (you can have multiple combining characters on a character).