Using ICU to implement my own codecvt facet

Question

I want to implement a codecvt facet using ICU to convert from any character encoding (that ICU supports) to UTF-8 internally. I'm aware that codecvt_byname exists and that it can be used to do part of what I want as shown in this example. The problems with that example are that it (1) uses wide character streams (I want to use "regular", byte-oriented streams) and (2) requires 2 streams to perform the conversion. Instead, I want a single stream like:

locale loc( locale(), new icu_codecvt( "ISO-8859-1" ) );
ifstream ifs;
ifs.imbue( loc );
ifs.open( "/path/to/some/file.txt" );
// data read from ifs here will have been converted from ISO-8859-1 to UTF-8

Hence, I wand to do an implementation like this but using ICU rather than iconv. Given that, my implementation of do_in() is:

icu_codecvt::result icu_codecvt::do_in( state_type &state,
                                        extern_type const *from, extern_type const *from_end,
                                        extern_type const *&from_next, intern_type *to,
                                        intern_type *to_end, intern_type *&to_next ) const {
  from_next = from;
  to_next = to;
  if ( always_noconv_ )
    return noconv;

  our_state *const s = state_store_.get( state );
  UErrorCode err = U_ZERO_ERROR;
  ucnv_convertEx(
    s->utf8_conv_, s->extern_conv_, &to_next, to_end, &from_next, from_end,
    nullptr, nullptr, nullptr, nullptr, false, false, &err
  );
  if ( err == U_TRUNCATED_CHAR_FOUND )
    return partial;
  return U_SUCCESS( err ) ? ok : error;
}

The our_state object maintains two UConverter* pointers, one for the "external" encoding (in this example, ISO-8859-1) and one for the UTF-8 encoding.

My questions are:

Should I specify nullptr for the "pivot" buffer as above, or supply my own?
I'm not sure when, if ever, I should set the reset argument (currently the first false above) to true.
It's not clear how I would know when to set the flush argument (currently the second false above) to true, i.e., how I know when the end of the input has been reached.

A little help?

You should imbue() your file stream before opening the file. A lot of systems will silently ignore the imbue() if the file is already open (this is because state about the conversation may have been lost). — Martin York, Dec 30 '11 at 15:54

score 0 · Answer 1 · answered Dec 30 '11 at 17:40

0

The codecvt facet is not intended to convert between different encodings. Instead, it converts from an external encoding where one character is possibly encoded using multiple external word (typically bytes) into an internal representation where each character is represented by exactly one word (e.g. char, wchar_t, char16_t, etc.).

From this perspective it doesn't make sense to "end" an internal character sequence. If there are no more external words available the conversion is done and if the last character remained incomplete this is an error in the transfer. Thus, there is no need to indicate that the conversion is complete and, correspondingly, no interface. This should clarify that the "flush" argument indeed should always be "false".

I realize that UTF-8 doesn't quite fit the bill of having one word represent one character. However, this will haunt you enire UTF-8 processing using standard types processing strings. As long as you stay clear of syring modifications things typically work OK, though.

The "reset" parameter is probably intended to deal with seeking within a stream. I think filebuf is supposed to provide a fresh state_type object when seeking. This would probably be an indication that the ICU internals want to be reset. However, I don't know about the ICU interface. Thus, I also don't know if you want to supply a pivot buffer.

answered Dec 30 '11 at 17:40

Dietmar Kühl

150,225
13
225
380

My original idea was to have a transcoding streambuffer, but this guy http://stackoverflow.com/a/8453807/99089 said to use a codecvt -- so who's right? Or, how would you implement automatic conversion from an arbitrary encoding to UTF-8 using iostreams in an "elegant" way? – Paul J. Lucas Dec 30 '11 at 18:09
`char16_t` and `wchar_t` are *not* guaranteed to be one word per character. `char16_t` is specifically for UTF-16 code units, which have surrogate pairs. So no, it's not one 16-bit word per "character". – Nicol Bolas Dec 30 '11 at 19:27
@Nicol Bolas: actually wchar_t was intended to be one word per character (but the Unicode committee drop their stated goal of creating a 16 bit encoding for all characters shortly after Java and Windows decided to use 16 bits for wchar_t). You are, conceptually right about char16_t but the stream and string classes still assume that one word is one character. This is e.g. reflected in the codecvt interface. – Dietmar Kühl Dec 30 '11 at 19:51
@DietmarKühl: I think it *is* what I'm trying to do. It's just that my "internal representation" happens to be UTF-8. – Paul J. Lucas Dec 30 '11 at 20:19
@DietmarKühl: so what *is* the prescribed way to do what I want? – Paul J. Lucas Dec 30 '11 at 20:47
@DietmarKühl: so I got my implementation working only to discover that, apparently, only streams that use filebufs route characters through a codecvt facet -- in particular, cout, cin, and all stringstreams do not. That seems dumb to not use the facet universally and severely limits the usefulness of such a facet. Perhaps I should go back to my original streambuf idea. – Paul J. Lucas Dec 31 '11 at 22:51
@DietmarKühl: then I suppose what I want is a bytestream rather than a stringstream. I want to transform bytes in one form to another. I think I may have to go back to my original idea of using a my own streambuffer (rather than facet) to do the transcoding. – Paul J. Lucas Jan 02 '12 at 18:37

Using ICU to implement my own codecvt facet

1 Answers1

Linked