1

The ISO-8859-5 standard is a subset of the unicode character set. I want to test if a unicode character is supported in a character subset of ISO-8859-5 in C++. To do this I want to write a function like isLegal below, so that the following code will filter out non ISO-8859-5 characters.

Assume that wstring came from a unicode encoded string.

wstring str = L"AåБ0";
vector<char32_t> bytes(str.begin(), str.end());
for (vector<char32_t>::const_iterator i = bytes.begin(); i != bytes.end(); ++i){
if (isLegal(*i, "ISO-8859-5"))
{
  std::cout << (*i) << ' ';
}

}

The reason for this is that I would like to limit the supported characters to a subset of the unicode superset so that the user can't submit characters like emoji's and characters that are not in the supported languages. Thank you for your help.

Is there a simple way to do this. Using for instance codecs or something like that. For instance I know about a function from Qt is there anything in this vein that could help me?

QTextCodec *codec = QTextCodec::codecForName("ISO 8859-5");

Or perhaps a library out there that would do this for me.

Note: Why am I using wstring? My understanding is that unicode characters use between 1 and 4 bytes per character. This is the binary representation of the character which is different from when the character is rendered. std:string supports a multibyte string but when you try to isolate individual characters I didn't know where a character started and where it ended because the width of bytes in each character were inconsistent.

So I used a codec to decode the multibyte string into the std::wstring which is templated on wchar_t. wchar_t on Linux is 4 bytes wide, thus each character will have a consistent width. Because of this, if you put a multibyte unicode set into a wstring you can more easily identify each character since each character is a consistent width of 4 bytes and all unicode characters will fit into a 4 bit width so the wstring handles any possible characters from unicode.

N D
  • 717
  • 6
  • 10
  • 2
    I'm voting to close this question as off-topic because it's just a "give me the code". – Sebastian Redl Apr 27 '16 at 18:13
  • okay. apparently I don't understand the culture here. I searched for an answer to this for a few days on stack over flow. I have never asked a question here after many years of using the site and answering questions where I can. I am in the process of coming up with an answer and was going to post it once I'm done. Please advise me on how I could rewrite the question. – N D Apr 27 '16 at 22:06
  • If you were going to provide the code yourself, you should really have waited until you were ready to post both the question and answer, and submitted them at the same time. We all thought you wanted us to just do it for you. I'm voting to reopen; once several others have done the same, you will be able to post your answer. – Lightness Races in Orbit Apr 27 '16 at 22:15
  • Well, I'm not very satisfied with my solution so I was hopping to get some advice mid course. I'm forced to manually compare the character sets. I was hopping there was a function some where that I could have used like using a codec. – N D Apr 27 '16 at 22:18
  • 2
    For a one-off your 'manually comparing' code may very well be more efficient than a generalized codec routine. For instance: since you only have one single target encoding to test, you don't have to loop over each of its characters. You can sort the Cyrillic encoding on its (translated!) Unicode values and then efficiently use a binary lookup for each of your input characters. (Voted to reopen by the way ) – Jongware Apr 27 '16 at 22:24
  • Thank you, I appreciate it, I want to support codecs ISO 8859-1 to ISO 8859-5, I just rewrote the question just in case it was too broad. I'm afraid that in manually encoding the list I might miss character here or there. – N D Apr 27 '16 at 22:28
  • If you're using UTF-8, why are your strings using wide characters? Indeed `vector bytes(str.begin(), str.end());` make no sense, as wide characters need not be Unicode at all. And on Windows, they're UTF-16, which does not directly map to UTF-32 the way you do here. – Nicol Bolas Apr 27 '16 at 22:39
  • 2
    @ND: "*I want to support codecs ISO 8859-1 to ISO 8859-5*" Those are very different things, and you'd need a separate function to check for each one. Indeed, writing a checker for Latin-1 is trivial, while writing a checker for Latin/Cyrilic is far more difficult. – Nicol Bolas Apr 27 '16 at 22:43
  • @NicolBolas: not really "far more difficult", is it? Latin-1 indeed is trivial, but for other character sets all it needs is 1 (one) lookup routine, with a pointer to an array of 256 Unicode values (or 128, if you are really stingy, but it hardly matters for a binary search). ".. in manually encoding the list I might miss a character.." – just copy and paste them from web pages and sort, or look up presorted lists. – Jongware Apr 27 '16 at 22:50
  • @NicolBolas so you are saying that char32_t is too wide? should I use something else? – N D Apr 27 '16 at 23:22
  • @RadLexus I am using the tables here http://czyborra.com/charsets/iso8859.html these tables don't include the control characters for the old teletype machines which suites me fine. But I still have questions as to their completeness. Is there a better place that is known to have more reliable tables that I should be looking at? – N D Apr 27 '16 at 23:26
  • For all things Unicode™, have a look at `unicode.org`! Here is a list of all ISO 8859 encodings: ftp://ftp.unicode.org/Public/MAPPINGS/ISO8859 – Jongware Apr 27 '16 at 23:59
  • 1
    @ND: "*so you are saying that char32_t is too wide?*" I'm saying that it's not UTF-8. Your question is supposedly about how to test if a UTF-8 character sequence contains codepoints outside a certain range. Your code doesn't actually use UTF-8 *anywhere*. So what exactly is your question? It's like someone asking about OpenGL, but then posting a bunch of D3D code. – Nicol Bolas Apr 28 '16 at 00:07
  • I am assuming that wstring was derived from UTF. But I could clean that up I suppose. – N D Apr 28 '16 at 00:39
  • @ND You seem to be confused between code units and code points, and seem to ignore the fact that UTF-8 is a particular encoding used to express (encode) Unicode code points using bytes. – Kuba hasn't forgotten Monica Apr 28 '16 at 17:30
  • 1
    Thank you. I don't know the difference between code units and points. My understanding is that UTF-8 characters use between 1 and 4 bytes per character. This is the binary representation of the character which is different from when the character is rendered. The std::wstring is templated on wchar_t. wchar_t on Linux is 4 bytes wide, thus if you put a multibyte UTF-8 set into a wstring you can more easily identify each character since each character is a consistent width of 4 bytes and all UTF-8 characters will fit into a 4 bit width so the wstring handles any possible characters from UTF-8. – N D Apr 28 '16 at 17:51
  • @ND: That's indeed a correct. summary. The reason we're surprised is that programming indeed often is a matter of breaking a hard task into manageable chunks. Your UTF-8 title would be hard, but converting it to UTF-32 and then determining if it's from a subset is a lot easier. So you initially gave the impression of needing a lot more programming help, when it really was just a poor choice of words. – MSalters Apr 28 '16 at 18:19

2 Answers2

0

There is no Standard C++ library for character code conversions. In fact, I don't think a C++ implementation even needs to be aware of more than one encoding. So any solution is going to require a library, or else hand-crafted code (i.e. a big switch...).

Since you mention Qt, then yes, you should be able to use QTextCodec::canEncode:

#include <QDebug>
#include <QTextCodec>

#include <string>

int main() {
    std::wstring const str = L"AåБ0";
    auto const *codec = QTextCodec::codecForName("ISO-8859-5");
    if (!codec) {
        qFatal("Codec not found");
    }

    qDebug() << "Using codec" << qPrintable(codec->name());

    for (auto c: str) {
        if (codec->canEncode(c))
            qDebug() << c;
    }
}

But this gives me

Using codec ISO-8859-5
65
229
1041
128512
128580
128545
48

So that's a non-solution.

Toby Speight
  • 27,591
  • 48
  • 66
  • 103
  • 1
    Although `canEncode` is indeed broken, you're passing large `c`'s that require a surrogate pair to represent. You need to be passing them as surrogate pairs encoded in a `QString` instead. You need to use `QChar` surrogate logic to check if a given UCS-4 `c` is representable as a single `QChar` or as a surrogate pair, and go from there. – Kuba hasn't forgotten Monica Apr 28 '16 at 17:42
  • @Kuba - I hadn't spotted that those were outside BMP. Thanks for the clarification. – Toby Speight Apr 28 '16 at 17:44
  • One thing I'm unsure of is whether L"foo" expects "foo" to be UTF-8 or what :( It's implementation defined, it seems, per [this answer](http://stackoverflow.com/a/25568251/1329652). – Kuba hasn't forgotten Monica Apr 28 '16 at 17:53
0

For the moment I am using this custom solution:

#include <vector>
#include <string>
#include <boost/assign/std/vector.hpp>

using namespace std;
using namespace boost::assign; 

bool isIntInSet(int val, std::vector<int> set){
  if (std::find(set.begin(), set.end(), val) != set.end())
  {
    return true;
  }
  return false;
}

bool isLegal(int val, string isoNum){
  const string ISO8859_5 = "ISO8859-5";
  if (ISO8859_5 == isoNum){
    vector<int> isoSet5;
    isoSet5 += 0x0020,0x0021,0x0022,0x0023,0x0024,0x0025,0x0026,0x0027,0x0028,0x0029,0x002A,0x002B,0x002C,0x002D,0x002E,0x002F,0x0030,0x0031,0x0032,0x0033,0x0034,0x0035,0x0036,0x0037,0x0038,0x0039,0x003A,0x003B,0x003C,0x003D,0x003E,0x003F,0x0040,0x0041,0x0042,0x0043,0x0044,0x0045,0x0046,0x0047,0x0048,0x0049,0x004A,0x004B,0x004C,0x004D,0x004E,0x004F,0x0050,0x0051,0x0052,0x0053,0x0054,0x0055,0x0056,0x0057,0x0058,0x0059,0x005A,0x005B,0x005C,0x005D,0x005E,0x005F,0x0060,0x0061,0x0062,0x0063,0x0064,0x0065,0x0066,0x0067,0x0068,0x0069,0x006A,0x006B,0x006C,0x006D,0x006E,0x006F,0x0070,0x0071,0x0072,0x0073,0x0074,0x0075,0x0076,0x0077,0x0078,0x0079,0x007A,0x007B,0x007C,0x007D,0x007E,0x00A0,0x0401,0x0402,0x0403,0x0404,0x0405,0x0406,0x0407,0x0408,0x0409,0x040A,0x040B,0x040C,0x00AD,0x040E,0x040F,0x0410,0x0411,0x0412,0x0413,0x0414,0x0415,0x0416,0x0417,0x0418,0x0419,0x041A,0x041B,0x041C,0x041D,0x041E,0x041F,0x0420,0x0421,0x0422,0x0423,0x0424,0x0425,0x0426,0x0427,0x0428,0x0429,0x042A,0x042B,0x042C,0x042D,0x042E,0x042F,0x0430,0x0431,0x0432,0x0433,0x0434,0x0435,0x0436,0x0437,0x0438,0x0439,0x043A,0x043B,0x043C,0x043D,0x043E,0x043F,0x0440,0x0441,0x0442,0x0443,0x0444,0x0445,0x0446,0x0447,0x0448,0x0449,0x044A,0x044B,0x044C,0x044D,0x044E,0x044F,0x2116,0x0451,0x0452,0x0453,0x0454,0x0455,0x0456,0x0457,0x0458,0x0459,0x045A,0x045B,0x045C,0x00A7,0x045E,0x045F;
    if (isIntInSet(val, isoSet5))return true;
  }
  return false;
}

By looking up a list of visible character sets on http://czyborra.com/charsets/iso8859.html each character set does not include the control characters, so this is not the complete ISO8859-5 list of characters, but it seems good enough for all printable characters.

N D
  • 717
  • 6
  • 10
  • 2
    As all ISO-8859-1 sets are extensions of ASCII (control characters included), and so is UTF-8 (as is Unicode itself when you ignore leading zeroes). So you might want to split that test in two, a quick check for `<128` and a lookup for the rest. – MSalters Apr 28 '16 at 18:22
  • Nice idea. That would make it more efficient. – N D May 02 '16 at 18:21