5

I'm changing a software in C++, wich process texts in ISO Latin 1 format, to store data in a database in SQLite.
The problem is that SQLite works in UTF-8... and the Java modules that use same database work in UTF-8.

I wanted to have a way to convert the ISO Latin 1 characters to UTF-8 characters before storing in the database. I need it to work in Windows and Mac.

I heard ICU would do that, but I think it's too bloated. I just need a simple convertion system(preferably back and forth) for these 2 charsets.

How would I do that?

hippietrail
  • 15,848
  • 18
  • 99
  • 158
gabriel
  • 199
  • 1
  • 2
  • 10
  • 2
    Are you using Windows Latin-1 or true ISO Latin 1? –  Apr 07 '11 at 19:10
  • I would have suggested using Glib's wrapper for iconv which converts easily between any 2 charsets, but if you are sure that you need only latin1->utf8, then @Evan 's solution below is the simplest. In any way, ICU seems way to big for this. – davka Apr 07 '11 at 19:59

4 Answers4

17

ISO-8859-1 was incorporated as the first 256 code points of ISO/IEC 10646 and Unicode. So the conversion is pretty simple.

for each char:

uint8_t ch = code_point; /* assume that code points above 0xff are impossible since latin-1 is 8-bit */

if(ch < 0x80) {
    append(ch);
} else {
    append(0xc0 | (ch & 0xc0) >> 6); /* first byte, simplified since our range is only 8-bits */
    append(0x80 | (ch & 0x3f));
}

See http://en.wikipedia.org/wiki/UTF-8#Description for more details.

EDIT: according to a comment by ninjalj, latin-1 translates direclty to the first 256 unicode code points, so the above algorithm should work.

Community
  • 1
  • 1
Evan Teran
  • 87,561
  • 32
  • 179
  • 238
  • 2
    As I said, if it's **real** Latin1. Windows CP1252 (sometimes incorrectly called Latin1) has additional characters (in a range reserved in ISO-8859 for control characters), most notably, versions of opening and closing quotes. – ninjalj Apr 07 '11 at 19:55
  • 2
    Oh, and there's no below on SO ;-P – ninjalj Apr 07 '11 at 19:56
  • 2
    `(ch & 0xc0) >> 6` is redundant. You can just write `ch >> 6`. – dan04 Apr 08 '11 at 12:48
  • @dan04: can't ever hurt to be explicit. – Evan Teran Apr 08 '11 at 15:04
  • I really can't understand the table on the wikipedia link. so if i have Latin-1 Ç , that falls under below 11bits, but how does the above following formula work? – spakai Jun 07 '17 at 10:12
  • ok this demo shines some light - http://www.codeguru.com/cpp/misc/misc/multi-lingualsupport/article.php/c10451/The-Basics-of-UTF8.htm – spakai Jun 07 '17 at 10:24
2

TO c++ i use this:

std::string iso_8859_1_to_utf8(std::string &str)
{
    string strOut;
    for (std::string::iterator it = str.begin(); it != str.end(); ++it)
    {
        uint8_t ch = *it;
        if (ch < 0x80) {
            strOut.push_back(ch);
        }
        else {
            strOut.push_back(0xc0 | ch >> 6);
            strOut.push_back(0x80 | (ch & 0x3f));
        }
    }
    return strOut;
}
Lord Raiden
  • 301
  • 2
  • 3
  • This solution does seem to work for me on Unix systems but somehow does not seem to work on Windows with Visual Studio. Does anyone have any ideas? – MaestroMaus Jan 10 '18 at 18:46
1

If general-purpose charset frameworks (like iconv) are too bloated for you, roll your own.

Compose a static translation table (char to UTF-8 sequence), put together your own translation. Depending on what do you use for string storage (char buffers, or std::string or what) it would look somewhat differently, but the idea is - scroll through the source string, replace each character with code over 127 with its UTF-8 counterpart string. Since this can potentially increase string length, doing it in place would be rather inconvenient. For added benefit, you can do it in two passes: pass one determines the necessary target string size, pass two performs the translation.

Seva Alekseyev
  • 59,826
  • 25
  • 160
  • 281
  • 2
    If it's real Latin1, the translation table is trivial, Latin1 maps directly to the first 256 Unicode codepoints. – ninjalj Apr 07 '11 at 19:43
  • @ninjalj, this answer doesn't propose translating to codepoints but to UTF-8 sequences. Each sequence will be either one or two bytes. – Mark Ransom Apr 07 '11 at 19:48
  • @Mark Ransom: it's the same, it's trivial to generate the table without having to look at loads of character tables. – ninjalj Apr 07 '11 at 19:51
  • @Mark: which, incidentally, you would have to to translate from/to CP1252 – ninjalj Apr 07 '11 at 19:52
0

If you don't mind doing an extra copy, you can just "widen" your ISO Latin 1 chars to 16-bit characters and thus get UTF-16. Then you can use something like UTF8-CPP to convert it to UTF-8.

In fact, I think UTF8-CPP could even convert ISO Latin 1 to UTF-8 directly (utf16to8 function) but you may get a warning.

Of course, it needs to be real ISO Latin 1, not Windows CP 1232.

Nemanja Trifunovic
  • 24,346
  • 3
  • 50
  • 88