The ISO-8859-5 standard is a subset of the unicode character set. I want to test if a unicode character is supported in a character subset of ISO-8859-5 in C++. To do this I want to write a function like isLegal below, so that the following code will filter out non ISO-8859-5 characters.
Assume that wstring came from a unicode encoded string.
wstring str = L"AåБ0";
vector<char32_t> bytes(str.begin(), str.end());
for (vector<char32_t>::const_iterator i = bytes.begin(); i != bytes.end(); ++i){
if (isLegal(*i, "ISO-8859-5"))
{
std::cout << (*i) << ' ';
}
}
The reason for this is that I would like to limit the supported characters to a subset of the unicode superset so that the user can't submit characters like emoji's and characters that are not in the supported languages. Thank you for your help.
Is there a simple way to do this. Using for instance codecs or something like that. For instance I know about a function from Qt is there anything in this vein that could help me?
QTextCodec *codec = QTextCodec::codecForName("ISO 8859-5");
Or perhaps a library out there that would do this for me.
Note: Why am I using wstring? My understanding is that unicode characters use between 1 and 4 bytes per character. This is the binary representation of the character which is different from when the character is rendered. std:string supports a multibyte string but when you try to isolate individual characters I didn't know where a character started and where it ended because the width of bytes in each character were inconsistent.
So I used a codec to decode the multibyte string into the std::wstring which is templated on wchar_t. wchar_t on Linux is 4 bytes wide, thus each character will have a consistent width. Because of this, if you put a multibyte unicode set into a wstring you can more easily identify each character since each character is a consistent width of 4 bytes and all unicode characters will fit into a 4 bit width so the wstring handles any possible characters from unicode.