3

I try to make method which converts s-jis string to utf-8 string using iconv. I wrote a code below,

#include <iconv.h>
#include <iostream>
#include <stdio.h>
using namespace std;

#define BUF_SIZE 1024
size_t z = (size_t) BUF_SIZE-1;

bool sjis2utf8( char* text_sjis, char* text_utf8 )
{
  iconv_t ic;
  ic = iconv_open("UTF8", "SJIS"); // sjis->utf8
  iconv(ic , &text_sjis, &z, &text_utf8, &z);
  iconv_close(ic);
  return true;
}
int main(void)
{
  char hello[BUF_SIZE] = "hello";
  char bye[BUF_SIZE] = "bye";
  char tmp[BUF_SIZE] = "something else";

  sjis2utf8(hello, tmp);
  cout << tmp << endl;

  sjis2utf8(bye, tmp);
  cout << tmp << endl;
}

and, output should be

hello
bye

but in fact,

hello
hello

Does anyone know why this phenomenon occurs? What's wrong with my program?

Note that "hello" and "bye" are Japanese s-jis strings in my original program, but I altered it to make program easy to see.

fbessho
  • 1,462
  • 1
  • 14
  • 19
  • 3
    please note that in your example z is decremented twice for every converted character. – joy Nov 12 '11 at 11:48
  • also you should not cout utf8, you should cout ascii and wcout UCS2 on windows and UCS4 on Linux. – joy Nov 12 '11 at 11:51

4 Answers4

3

I think you are misusing the iconv function by passing it the global variable z. The first time you call sjis2utf8, z is decremented to 0. The second call to sjis2utf8 have no effect (z==0) and leaves tmp unchanged.

From the iconv documentation :

size_t iconv (iconv_t cd,
              const char* * inbuf, size_t * inbytesleft,
              char* * outbuf, size_t * outbytesleft);

The iconv function converts one multibyte character at a time, and for each character conversion it increments *inbuf and decrements *inbytesleft by the number of converted input bytes, it increments *outbuf and decrements *outbytesleft by the number of converted output bytes, and it updates the conversion state contained in cd.

You should use two separate variables for the buffers lengths :

size_t il = BUF_SIZE - 1 ;
size_t ol = BUF_SIZE - 1 ;

iconv(ic, &text_sjis, &il, &text_utf8, &ol) ;

Then check the return value of iconv and the buffers lengths for the conversion success.

overcoder
  • 1,523
  • 14
  • 24
  • after first conversion z will overflow. In c/c++ it will wrap around if size_t is unsigned. If size_t is signed it will cause undefined behaviour. – joy Nov 12 '11 at 12:02
  • also in your second statement... to use 2 variables iconv will convert BUF_SIZE -1 characters and not only the string contents... in his example this is obvious because the null terminator will be converted also to 0 and cout will stop at that moment. – joy Nov 12 '11 at 12:06
  • It depends on the `iconv` internal checks implementation. As you said in a previous comment : *z is decremented twice for every converted character*, but we don't know how it checks and stops when reaching 0. It can overflow (yes, size_t is unsigned). – overcoder Nov 12 '11 at 12:08
1
#include <iconv.h>
#include <iostream>
#include <stdio.h>
#include <string.h>

using namespace std;

const size_t BUF_SIZE=1024;


class IConv {
    iconv_t ic_;
public:
    IConv(const char* to, const char* from) 
        : ic_(iconv_open(to,from))    { }
    ~IConv() { iconv_close(ic_); }

     bool convert(char* input, char* output, size_t& out_size) {
        size_t inbufsize = strlen(input)+1;// s-jis string should be null terminated, 
                                           // if s-jis is not null terminated or it has
                                           // multiple byte chars with null in them this
                                           // will not work, or to provide in other way
                                           // input buffer length....
        return iconv(ic_, &input, &inbufsize, &output, &out_size);
     }
};

int main(void)
{
    char hello[BUF_SIZE] = "hello";
    char bye[BUF_SIZE] = "bye";
    char tmp[BUF_SIZE] = "something else";
    IConv ic("UTF8","SJIS");

    size_t outsize = BUF_SIZE;//you will need it
    ic.convert(hello, tmp, outsize);
    cout << tmp << endl;

    outsize = BUF_SIZE;
    ic.convert(bye, tmp, outsize);
    cout << tmp << endl;
}
  • based on Kleist's answer
Brock Adams
  • 90,639
  • 22
  • 233
  • 295
joy
  • 1,569
  • 8
  • 11
0

You must put length of entry string in third parameter of iconv.

Try:

//...
int len = strlen(text_sjis);
iconv(ic , &text_sjis, &len, &text_utf8, &z);
//...
masoud
  • 55,379
  • 16
  • 141
  • 208
0
size_t iconv (iconv_t cd,
          const char* * inbuf, size_t * inbytesleft,
          char* * outbuf, size_t * outbytesleft);

iconv changes the value pointed to by inbytesleft. So after your first run z is 0. To fix this you should use calculate the length of inbuf and store it in a local variable before each conversion.

It is described here: http://www.gnu.org/s/libiconv/documentation/libiconv/iconv.3.html

And since you tagged this as C++ I would suggest wrapping everything up in a nice little class, as far as I can tell from the documentation you can reuse the inconv_t gained from iconv_open for as many conversions as you'd like.

#include <iconv.h>
#include <iostream>
#include <stdio.h>
#include <string.h>

using namespace std;

const size_t BUF_SIZE = 1024;
size_t z = (size_t) BUF_SIZE-1;

class IConv {
    iconv_t ic_;
public:
    IConv(const char* to, const char* from) 
        : ic_(iconv_open(to,from))    { }

    ~IConv() { iconv_close(ic_); }

    bool convert(char* input, char* output, size_t outbufsize) {
        size_t inbufsize = strlen(input);
        return iconv(ic_, &input, &inbufsize, &output, &outbufsize);
    }
};

int main(void)
{
    char hello[BUF_SIZE] = "hello";
    char bye[BUF_SIZE] = "bye";
    char tmp[BUF_SIZE] = "something else";
    IConv ic("UTF8","SJIS");


    ic.convert(hello, tmp, BUF_SIZE);
    cout << tmp << endl;

    ic.convert(bye, tmp, BUF_SIZE);
    cout << tmp << endl;
}
Kleist
  • 7,785
  • 1
  • 26
  • 30
  • I think your answer is flawed because it will always convert 1024 characters/byte sequence. – joy Nov 12 '11 at 12:12
  • Thanks a lot. Your code works well only after I changed `const size_t BUF_SIZE` to `#define BUF_SIZE`, and it's nice because actually I have to convert from sjis string to utf8 string in other places. – fbessho Nov 12 '11 at 12:24
  • @neagoegab: Changed to use strlen and passing outbufsize as argument. I see no need to make a local variable and pass a pointer to it as you've done in your copy. – Kleist Nov 12 '11 at 18:40
  • BUF_SIZE is a constant... it can not be modified. iconv does write to outbufsize. Also him will need the size of the converted text for further manipulation. – joy Nov 13 '11 at 00:48
  • @neagoegab My code doesn't modify BUF_SIZE, since it is passed by value. Who says he needs the length of the converted text? Many APIs simply expect null-terminated C-strings. – Kleist Nov 14 '11 at 01:08
  • Oops, you are right about BUF_SIZE. My mistake. +one beer from me. About the c string, there are encodings where 0(zero) is a bad byte and null character is encoded as some other value. Libicu, iconv are working on byte sequences. – joy Nov 14 '11 at 08:38