How to convert raw MBCS strings (SHIFT-JIS) from windows to UTF-8 on linux

Question

I'm writing a program on Linux that has to interface with a existing windows program. I cannot modify the way the windows program works, but I must integrate with the existing data. This program will receive raw data structures over a TCP network socket. Unfortunately the windows program embeds raw multibyte character strings in the data structures and does not indicate which codepage is in use. This works OK for english, but fails miserably with non-latin based languages (ie: japanese). At best, I can guess at the code page windows is using. If I'm running and my locale is set to "ja" or "ja_JP" I'll have to assume the windows machine is using the "SHIFT-JS" codepage... Ugly but that's life.

QUESTION:

Assuming I've guessed correctly at the codepage, how can I convert these raw MBCS character strings to UTF-8 strings?

Here is a sample of the raw data:

The string being sent is: 私のクラスへようこそ

The MBCS data received from windows (JP) is (in bytes, extra "0x00" added to ensure null termination) :

char kanji_win_raw_bytes[] =  { 0x8E, 0x84, 0x82, 0xCC, 0x83, 0x4E, 0x83, 0x89, 0x83, 0x58, 0x82, 0xD6, 0x82, 0xE6, 0x82, 0xA4, 0x82, 0xB1, 0x82, 0xBB, 0x00, 0x00, 0x00 };

As nearly as I can tell, the string is coming from a windows machine using the SHIFT-JS codepage. I've tried mbsrtowcs():

const char *ptr = (char*)m_data;
// m_data contains the byte array of MBCS data
if ( m_data != NULL )
{
    std::mbstate_t state = std::mbstate_t();

    size_t bufflen = std::mbsrtowcs(NULL, &ptr, 0, &state);
    if ( bufflen == (size_t)-1 )
    {
        std::cout << "ERROR! mbsrtowcs() " << strerror(errno) << std::endl;
        std::cout << "Error at: " <<  (int32_t)( (char*)ptr - (char*)m_data ) << std::endl;
        return;
    }

    std::vector<wchar_t> wstr(bufflen);
    std::cout << "converting " << bufflen << " characters" << std::endl;
    std::mbsrtowcs(&wstr[0], &ptr, wstr.size(), &state);
    std::wcout << "Wide string: " << &wstr[0] << std::endl
        << "The length, including '\\0': " << wstr.size() << std::endl;
}

The call to mbsrtowcs() fails at position "0" with no characters converted.

I then tried the iconv libraries using the SHIFT-JS codepage:

bytes_converted = 0;
char input[4096] = {0};
char dst[4096] = {0};
char* src = input;
size_t dstlen = sizeof(dst);
size_t srclen = 0;
iconv_t conv = iconv_open("UTF-8", "SHIFT-JIS" );

// make a copy
memcpy( (void*)input, (void*)kanji_win_raw_bytes, sizeof(kanji_win_raw_bytes) );
srclen = sizeof(kanji_win_raw_bytes);

if ( conv != (iconv_t)-1 )
{
    bytes_converted = iconv( conv, NULL, NULL, (char**)&dst, &dstlen );
    if ( bytes_converted == (size_t) -1 )
    {
        std::cerr << "ERROR: initializing output buffer: (" << errno << ") " << strerror(errno) << std::endl;
    }
    bytes_converted = iconv(conv, (char**)&src, &srclen, (char**)&dst, &dstlen);
    if ( bytes_converted == (size_t) - 1)
    {
        std::cerr << "ERROR in conversion: (" << errno << ") " << strerror(errno) << std::endl;
        if ( errno == EINVAL )
        {
                std::cerr << "RESULT: iconv() converted " << bytes_converted << " bytes: [" << dst << "]" << std::endl;
        }

    }
    else
    {
        std::cerr << "SUCCESS: iconv() converted " << bytes_converted << " bytes: [" << dst << "]" << std::endl;
    }
    iconv_close(conv);
}
else
{
    std::cerr << "ERROR: iconv_open() failed: " << strerror(errno) << std::endl;
}

Iconv segfaults (coredumps) using the given (Japanese) string. Having only used iconv a few times, I believe the code snippits (copied from online samples) are correct and seem to work ok with latin based languages using a similar setup but different (ie: German / French) mbcs strings coming from the windows server.

The codecvt functions std::wstring_convert does not yet seem to be implemented in linux even when compiling with -std=c++11 so that doesn't appear to be an option.

Thanks in advance for any help you can provide.

-- Edit --

With the help of "myk", I created a sample application that better shows my problem. With his suggestions, I was able to get around the segmentation fault, however the windows MBCS string fails to convert regardless of the locale I choose.

/**
 * MBCS test
 */

    #include <stdlib.h>
    #include <unistd.h>
    #include <stdint.h>
    #include <stdio.h>
    #include <sys/types.h>
    #include <string.h>
    #include <errno.h>

    #include <clocale>
    #include <string>
    #include <iostream>


    // 私のクラスへようこそ   (welcome to my class)
    const char* kanji_string = "私のクラスへようこそ";
    // This is what raw UTF-8 should look like
    uint8_t kanji_utf8_raw_bytes[] = { 0xE7, 0xA7, 0x81, 0xE3, 0x81, 0xAE, 0xE3, 0x82, 0xAF, 0xE3, 0x83, 0xA9, 0xE3, 0x82, 0xB9, 0xE3, 0x81, 0xB8, 0xE3, 0x82, 0x88, 0xE3, 0x81, 0x86, 0xE3, 0x81, 0x93, 0xE3, 0x81, 0x9D };

    // This is Windows MBCS using the SHIFT-JS code page
    uint8_t kanji_win_raw_bytes[] = { 0x8E, 0x84, 0x82, 0xCC, 0x83, 0x4E, 0x83, 0x89, 0x83, 0x58, 0x82, 0xD6, 0x82, 0xE6, 0x82, 0xA4, 0x82, 0xB1, 0x82, 0xBB, 0x00, 0x00, 0x00 };

    int main( int argc, char **argv )
    {
        std::setlocale(LC_ALL, "en_US.utf8");

        std::cout << "KANJI    String  [" << kanji_string << "]" << std::endl;  
        std::cout << "KANJI UTF-8 Raw  [" << kanji_utf8_raw_bytes << "]" << std::endl;  

        const char *data = (char*)kanji_win_raw_bytes;
        std::mbstate_t state = std::mbstate_t();
        size_t result = 0;

        wchar_t* buffer = (wchar_t*)malloc( sizeof(wchar_t) * (strlen((char*)data) + 1) );

        if ( buffer )
        {
            result = std::mbsrtowcs(buffer, &data, strlen(data), &state);
            if ( result == (size_t)-1 )
            {
                std::cout << "ERROR! mbsrtowcs() " << strerror(errno) << std::endl;
                std::cout << "Error at: " <<  (int32_t)( (char*)data - (char*)kanji_win_raw_bytes ) << std::endl;
            }
            else
            {
                std::wcout << "Wide string: [" << buffer << "] " << std::endl;
            }
            free( buffer );
        }

        return 0;
    }

Note: this can be compiled and run on Linux/Mac with the following command:

g++ mbcs_test.cpp -o mbcs_test && ./mbcs_test

score 2 · Answer 1 · answered Jul 17 '14 at 13:03

For mbsrtowcs(), a couple of things:

1) The call:

size_t bufflen = std::mbsrtowcs(NULL, &ptr, 0, &state);

should be something like:

size_t bufflen = std::mbsrtowcs(buffer, &ptr, strlen(m_data), &state);

assuming that you have declared 'buffer' with something like:

wchar_t* buffer = (wchar_t*) malloc(sizeof(wchar_t) * (strlen(m_data) + 1));

The third parameter in mbsrtowcs(), which you set to zero, is the length of the result buffer, which is presumably why 0 characters are getting converted.

2) My experience is you need to have used setlocale() for mbsrtowcs() to work. I can't see from the code snippet, but suggest you include something like:

#include <clocale>

:

std::setlocale(LC_ALL, "en_US.utf8");

Thanks for the suggestions. You are correct, I neglected to include setlocale() call in my sample above. I've updated the sample with your suggestions and it did solve the crash issue. There still exists the problem that mbsrtowcs still does not recognize the windows string as a valid MBCS string. It always returns with the "Invalid or incomplete multibyte or wide character" at position "0". — Zoccadoum, Oct 09 '14 at 19:33

How to convert raw MBCS strings (SHIFT-JIS) from windows to UTF-8 on linux

1 Answers1