Processing Unicode characters in C++

Question

I have a file that contains Unicode text in an unstated encoding. I want to scan through this file looking for any Arabic code points in the range U+0600 through U+06FF, and map each applicable Unicode code point to a byte of ASCII, so that the newly produced file will be composed of purely ASCII characters, with all code points under 128.

How do I go about doing this? I tried to read them the same way as you read ASCII, but my terminal shows ?? because it’s a multi-byte character.

NOTE: the file is made up of a subset of the Unicode character set, and the subset size is smaller than the size of ASCII characters. Therefore I am able to do a 1:1 mapping from this particular Unicode subset to ASCII.

I tried to read them the same way to read ASCII but i get ?? since its a multi-byte character — Mike G, Feb 17 '12 at 18:38
What is the encoding of the unicode? UTF7? UTF8? UTF16BE? UTF16LE? UTF32BE? UTF32LE? UCS2? Something else? (All of those can be described as "multibyte" unicode) — Mooing Duck, Feb 17 '12 at 18:38
"I have a file that contains text in Unicode." Files do not contain text in "Unicode". They contain text that is stored in one of the many Unicode *encodings*. Without knowing what encoding the file is using, you cannot know how to process it. — Nicol Bolas, Feb 17 '12 at 18:40
[The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)](http://www.joelonsoftware.com/articles/Unicode.html) — Mooing Duck, Feb 17 '12 at 18:41
The file contains arabic characters which i believe is (0600—06FF, 225 characters) — Mike G, Feb 17 '12 at 18:41
@MikeG: Note that ASCII cannot hold any Arabic characters, so this task is impossible. Also, you still never told us the encoding. — Mooing Duck, Feb 17 '12 at 18:42
U+0600 - U+06FF is more than 225 characters. But how do you want to map them, to 0x00 - 0xFF? — Mr Lister, Feb 17 '12 at 18:43
@MooingDuck I know that, what i want to do is map an arabic character to an english character(ASCII) how do i find out the encoding what found from my search is it is unicode 6.1 — Mike G, Feb 17 '12 at 18:44
@MrLister i just want to mapp a subset of the 225 not all 225 characters — Mike G, Feb 17 '12 at 18:45
@MooingDuck he just means mapping to single char bytes, I think that's clear. What's not clear is why he thinks the file actually contains Arabic codepoints, since he gets ??. ?? usually means the file contains "??". — Mr Lister, Feb 17 '12 at 18:46
If you show us the first 16 or so byte values, we could probably guess. — Mooing Duck, Feb 17 '12 at 18:46
You still haven't told us what encoding the file is in or HOW you want to map the Arabic codepoints to bytes. You can't map 225 codepoints to "English", because "English" is only 52 codepoints. — Mr Lister, Feb 17 '12 at 18:47
@MrLister: It's required that a conversion between unicode formats translates unconvertable characters to a `?` or similar character, so it's very common. — Mooing Duck, Feb 17 '12 at 18:47
@MooingDuck i am sure its Arabic code points because i created the file in microsoft word. But when i try to read and print the file in the console i see ??? — Mike G, Feb 17 '12 at 18:48
This is not a job for C⁺⁺. It is a job for a simple domain-specific tool like *iconv*. At most, it is a job for Perl, which has the best Unicode support you’ll fine. But it is definitely overkill to have to deal with such a trivial issue in C⁺⁺. — tchrist, Feb 17 '12 at 18:51
@MikeG I'm beginning to get the feeling that you want to transliterate the Arabic, not map some codepoints to some other codepoints; can that be right? — Mr Lister, Feb 17 '12 at 18:53
@tchrist Well, in my defense, he did mention ASCII to begin with. — Mr Lister, Feb 17 '12 at 18:54
@MikeG I updated your question title, text, and tags to make it easier for people to understand and search for. If I got any of it wrong, please forgive me and feel free to change it to whatever you think best. — tchrist, Feb 17 '12 at 19:54

tchrist · Accepted Answer · 2012-02-17T19:22:15.830

This is either impossible, or it’s trivial. Here are the trivial approaches:

If no code point exceeds 127, then simply write it out in ASCII. Done.
If some code points exceed 127, then you must choose how to represent them in ASCII. A common strategy is to use XML syntax, as in α for U+03B1. This will take up to 8 ASCII characters for each trans-ASCII Unicode code point transcribed.

The impossible ones I leave as an excercise for the original poster. I won’t even mention the foolish-but-possible (read: stupid) approaches, as these are legion. Data destruction is a capital crime in data processing, and should be treated as such.

Note that I am assuming by ‘Unicode character’ you actually mean ‘Unicode code point’; that is, a programmer-visible character. For user-visible characters, you need ‘Unicode grapheme (cluster)’ instead.

Also, unless you normalize your text first, you’ll hate the world. I suggest NFD.

EDIT

After further clarification by the original poster, it seems that what he wants to do is very easily accomplished using existing tools without writing a new program. For example, this converts a certain set of Arabic characters from a UTF-8 input file into an ASCII output file:

$ perl -CSAD -Mutf8 -pe 'tr[ابتثجحخد][abttjhhd]' < input.utf8 > output.ascii

That only handles these code points:

U+0627 ‭ ا  ARABIC LETTER ALEF
U+0628 ‭ ب  ARABIC LETTER BEH
U+0629 ‭ ة  ARABIC LETTER TEH MARBUTA
U+062A ‭ ت  ARABIC LETTER TEH
U+062B ‭ ث  ARABIC LETTER THEH
U+062C ‭ ج  ARABIC LETTER JEEM
U+062D ‭ ح  ARABIC LETTER HAH
U+062E ‭ خ  ARABIC LETTER KHAH
U+062F ‭ د  ARABIC LETTER DAL

So you’ll have to extend it to whatever mapping you want.

If you want it in a script instead of a command-line tool, it’s also easy, plus then you can talk about the characters by name by setting up a mapping, such as:

 "\N{ARABIC LETTER ALEF}"   =>  "a",
 "\N{ARABIC LETTER BEH}"    =>  "b",
 "\N{ARABIC LETTER TEH}"    =>  "t",
 "\N{ARABIC LETTER THEH}"   =>  "t",
 "\N{ARABIC LETTER JEEM}"   =>  "j",
 "\N{ARABIC LETTER HAH}"    =>  "h",
 "\N{ARABIC LETTER KHAH}"   =>  "h",
 "\N{ARABIC LETTER DAL}"    =>  "d",

If this is supposed to be a component in a larger C++ program, then perhaps you will want to implement this in C++, possibly but not necessary using the ICU4C library, which includes transliteration support.

But if all you need is a simple conversion, I don’t understand why you would write a dedicated C++ program. Seems like way too much work.

This is getting out of hand... All i want to do is i want to read an arabic character and "map" and then decide what to replace it with from the english character set for example i read in (arabic character) -> G so i replace that particular arabic character with G so on and so forth i dont understand why this is soo hard — Mike G, Feb 17 '12 at 18:59
@MikeG that's where the encoding comes in, and why we were asking you what it was. Reading, for instance, UTF-8 files is a well-defined and well-documented process, and not hard at all! Reading files in an unknown format we can only guess at, now that's hard. — Mr Lister, Feb 17 '12 at 19:01
@MikeG I think you’re trying too hard. As just one example, `perl -CSAD -Mutf8 -pe 'tr[ابتثجحخد][abttjhhd]' < input > output` is a simple command line that trivially implements a portion of the sort of thing you appear to be talking about. Why are you writing a program when existing tools more than suffice? — tchrist, Feb 17 '12 at 19:06
@MikeG: This only works if your encoding is UTF8. Is your encoding UTF8? — Mooing Duck, Feb 17 '12 at 19:39
@MooingDuck This is true, because the standard Unicode encoding is UTF-8: more than 80% of the Web is now in UTF-8. For other encodings, insert a call to `iconv` into the pipeline first, such as `iconf -f UTF-32BE -t UTF-8 < input.utf32be | perl ...`. I make it a **very strict habit** of naming all text files with a suffix that clearly shows their encoding, like `foo.MacRoman`, `foo.latin1`, `foo.cp1252`, `foo.utf8`, or `foo.ascii`. I upgrade anything that isn’t in UTF-8 to be so because an all–UTF-8 workflow is cleanest and easiest to manage, and certainly sanest. Saves countless headaches! — tchrist, Feb 17 '12 at 19:48
@tchrist: UTF8 is by far the most widely used, but in this context I'd ask you refrain from saying Unicode is UTF8, that's confusing. The suffix thing is a great idea. As for UTF8 being cleanest and easiest, UTF32 would like a word with you. Either way, the important thing is you solved the OP's question. — Mooing Duck, Feb 17 '12 at 21:11
@MooingDuck I agree with you about not saying Unicode when one means UTF-8 or UTF-32 or whatnot. Notice I didn’t claim the web was majority-Unicode: I specified UTF-8. UTF-32 is good for internal processing, UTF-8 for external processing. UTF-16 is the worst of both worlds, and no one in their right mind would ever choose it if not forced into it. I prefer either of UTF-8 or UTF-32 over UTF-16 any day of the week. — tchrist, Feb 17 '12 at 21:19

score 1 · Answer 2 · answered Feb 17 '12 at 19:16

You cannot read in the data unless you know the format. Open the filein with microsoft word, and go to "Save As", "Other formats", "Plain Text (.txt)", save. At the conversion box, select "Other encoding", "Unicode" (which is UTF16LE) and "OK". That file is now saved as UTF16LE.

std:ifstream infile("myfile.txt", std::ios::binary); //open stream
infile.seekg (0, ios::end); //get it's size
int length = infile.tellg();
infile.seekg (0, ios::beg);
std::wstring filetext(length/2); //allocate space
ifstream.read((char*)&filetext[0], length); //read entire file
std::string final(length/2);
for(int i=0; i<length/2; ++i) { //"shift" the variables to "valid" range
    if (filetext[length/2] >= 0x600 && filetext[length/2] <= 0xFF)
        final[length/2] = filetext[length/2]-0x600;
    else
        throw std::exception("INVALID CHARACTER");
}
//done

Warnings all over: I highly doubt this will result in what you want, but this is the best that can be managed, since you haven't told us the translation that needs doing, or the format of the file. Also, I'm assuming your computer and compiler are the same as mine. If not, some or all of this might be wrong, but it's the best I can do with this missing information you haven't told us.

score 0 · Answer 3 · answered Feb 18 '12 at 02:18

In order to parse out Unicode codepoints, you have to first decode the file into its unencoded Unicode representation (which is equivilent to UTF-32). In order to do that, you first need to know how the file was encoded so it can be decoded. For instance, Unicode codepoints U+0600 and U+06FF are encoded as 0xD8 0x80 and 0xDB 0xBF in UTF-8, as 0x00 0x06 and 0xFF 0x06 in UTF-16LE, as 0x06 0x00 and 0x06 0xFF in UTF-16BE, etc.

If the file starts with a BOM, then you know the exact encoding used and can interpret the rest of the file accordingly. For instance, the UTF-8 BOM is 0xEF 0xBB 0xBF, UTF-16LE is 0xFF 0xFE, UTF-16BE is 0xFE 0xFF, and so on.

If the file does not start with a BOM, then you have to analyze the data and perform heiristics on it to detect the encoding, but that is not 100% reliable. Although it is fairly easy to detect UTF encodings, it is nearly impossible to detect Ansi encodings with any measure of reliability. Even detecting UTF encodings without a BOM present can cause false results at times (read this, this, and this).

Don't ever guess, you will risk data loss. If you do not know the exact encoding used, ask the user for it.

Processing Unicode characters in C++

3 Answers3

EDIT