Transcoding characters on-the-fly using iostreams and ICU

Question

I'd like to transcode character encoding on-the-fly. I'd like to use iostreams and my own transcoding streambuf, e.g.:

xcoder_streambuf xbuf( "UTF-8", "ISO-8859-1", cout.rdbuf() );
cout.rdbuf( &xbuf );

char *utf8_s;    // pointer to buffer containing UTF-8 encoded characters
// ...
cout << utf8_s;  // characters are written in ISO-8859-1

The implementation of xcoder_streambuf would use ICU's converters API. It would take the data coming in (in this case, from utf8_s), transcode it, and write it out using the iostream's original steambuf.

Is that a reasonable way to go? If not, what would be better?

score 0 · Answer 1 · answered Dec 10 '11 at 01:57

Is that a reasonable way to go?

Yes, but it is not the way you are expected to do it in modern (as in 1997) iostream.

The behaviour of outputting through basic_streambuf<> is defined by the overflow(int_type c) virtual function.

The description of basic_filebuf<>::overflow(int_type c = traits::eof()) includes a_codecvt.out(state, b, p, end, xbuf, xbuf+XSIZE, xbuf_end); where a_codecvt is defined as:

const codecvt<charT,char,typename traits::state_type>& a_codecvt 
     = use_facet<codecvt<charT,char,typename traits::state_type> >(getloc());

so you are expected to imbue a locale with the appropriate codecvt<charT,char,typename traits::state_type> converter.

The class codecvt<internT,externT,stateT> is for use when converting from one character encoding to another, such as from wide characters to multibyte characters or between wide character encodings such as Unicode and EUC.

The standard library support for Unicode made some progress since 1997:

the specialization codecvt converts between the UTF-32 and UTF-8 encoding schemes.

This seems what you want (ISO-8859-1 codes are USC-4 codes = UTF-32).

If not, what would be better?

I would introduce a different type for UTF8, like:

struct utf8 {
    unsigned char d; // d for data
};

struct latin1 {
    unsigned char c; // c for character 
};

This way you cannot accidentally pass UTF8 where ISO-8859-* is expected. But then you would have to write some interface code, and the type of your streams won't be istream/ostream.

Disclaimer: I never actually did such a thing, so I don't know if it is workable in practice.

This guy disagrees about using codecvt: http://stackoverflow.com/a/8682250/99089 -- so who's right? — Paul J. Lucas, Dec 30 '11 at 18:08

Transcoding characters on-the-fly using iostreams and ICU

1 Answers1

Linked