I'm writing a program on Linux that has to interface with a existing windows program. I cannot modify the way the windows program works, but I must integrate with the existing data. This program will receive raw data structures over a TCP network socket. Unfortunately the windows program embeds raw multibyte character strings in the data structures and does not indicate which codepage is in use. This works OK for english, but fails miserably with non-latin based languages (ie: japanese). At best, I can guess at the code page windows is using. If I'm running and my locale is set to "ja" or "ja_JP" I'll have to assume the windows machine is using the "SHIFT-JS" codepage... Ugly but that's life.
QUESTION:
Assuming I've guessed correctly at the codepage, how can I convert these raw MBCS character strings to UTF-8 strings?
Here is a sample of the raw data:
The string being sent is: 私のクラスへようこそ
The MBCS data received from windows (JP) is (in bytes, extra "0x00" added to ensure null termination) :
char kanji_win_raw_bytes[] = { 0x8E, 0x84, 0x82, 0xCC, 0x83, 0x4E, 0x83, 0x89, 0x83, 0x58, 0x82, 0xD6, 0x82, 0xE6, 0x82, 0xA4, 0x82, 0xB1, 0x82, 0xBB, 0x00, 0x00, 0x00 };
As nearly as I can tell, the string is coming from a windows machine using the SHIFT-JS codepage. I've tried mbsrtowcs():
const char *ptr = (char*)m_data;
// m_data contains the byte array of MBCS data
if ( m_data != NULL )
{
std::mbstate_t state = std::mbstate_t();
size_t bufflen = std::mbsrtowcs(NULL, &ptr, 0, &state);
if ( bufflen == (size_t)-1 )
{
std::cout << "ERROR! mbsrtowcs() " << strerror(errno) << std::endl;
std::cout << "Error at: " << (int32_t)( (char*)ptr - (char*)m_data ) << std::endl;
return;
}
std::vector<wchar_t> wstr(bufflen);
std::cout << "converting " << bufflen << " characters" << std::endl;
std::mbsrtowcs(&wstr[0], &ptr, wstr.size(), &state);
std::wcout << "Wide string: " << &wstr[0] << std::endl
<< "The length, including '\\0': " << wstr.size() << std::endl;
}
The call to mbsrtowcs() fails at position "0" with no characters converted.
I then tried the iconv libraries using the SHIFT-JS codepage:
bytes_converted = 0;
char input[4096] = {0};
char dst[4096] = {0};
char* src = input;
size_t dstlen = sizeof(dst);
size_t srclen = 0;
iconv_t conv = iconv_open("UTF-8", "SHIFT-JIS" );
// make a copy
memcpy( (void*)input, (void*)kanji_win_raw_bytes, sizeof(kanji_win_raw_bytes) );
srclen = sizeof(kanji_win_raw_bytes);
if ( conv != (iconv_t)-1 )
{
bytes_converted = iconv( conv, NULL, NULL, (char**)&dst, &dstlen );
if ( bytes_converted == (size_t) -1 )
{
std::cerr << "ERROR: initializing output buffer: (" << errno << ") " << strerror(errno) << std::endl;
}
bytes_converted = iconv(conv, (char**)&src, &srclen, (char**)&dst, &dstlen);
if ( bytes_converted == (size_t) - 1)
{
std::cerr << "ERROR in conversion: (" << errno << ") " << strerror(errno) << std::endl;
if ( errno == EINVAL )
{
std::cerr << "RESULT: iconv() converted " << bytes_converted << " bytes: [" << dst << "]" << std::endl;
}
}
else
{
std::cerr << "SUCCESS: iconv() converted " << bytes_converted << " bytes: [" << dst << "]" << std::endl;
}
iconv_close(conv);
}
else
{
std::cerr << "ERROR: iconv_open() failed: " << strerror(errno) << std::endl;
}
Iconv segfaults (coredumps) using the given (Japanese) string. Having only used iconv a few times, I believe the code snippits (copied from online samples) are correct and seem to work ok with latin based languages using a similar setup but different (ie: German / French) mbcs strings coming from the windows server.
The codecvt functions std::wstring_convert does not yet seem to be implemented in linux even when compiling with -std=c++11 so that doesn't appear to be an option.
Thanks in advance for any help you can provide.
-- Edit --
With the help of "myk", I created a sample application that better shows my problem. With his suggestions, I was able to get around the segmentation fault, however the windows MBCS string fails to convert regardless of the locale I choose.
/**
* MBCS test
*/
#include <stdlib.h>
#include <unistd.h>
#include <stdint.h>
#include <stdio.h>
#include <sys/types.h>
#include <string.h>
#include <errno.h>
#include <clocale>
#include <string>
#include <iostream>
// 私のクラスへようこそ (welcome to my class)
const char* kanji_string = "私のクラスへようこそ";
// This is what raw UTF-8 should look like
uint8_t kanji_utf8_raw_bytes[] = { 0xE7, 0xA7, 0x81, 0xE3, 0x81, 0xAE, 0xE3, 0x82, 0xAF, 0xE3, 0x83, 0xA9, 0xE3, 0x82, 0xB9, 0xE3, 0x81, 0xB8, 0xE3, 0x82, 0x88, 0xE3, 0x81, 0x86, 0xE3, 0x81, 0x93, 0xE3, 0x81, 0x9D };
// This is Windows MBCS using the SHIFT-JS code page
uint8_t kanji_win_raw_bytes[] = { 0x8E, 0x84, 0x82, 0xCC, 0x83, 0x4E, 0x83, 0x89, 0x83, 0x58, 0x82, 0xD6, 0x82, 0xE6, 0x82, 0xA4, 0x82, 0xB1, 0x82, 0xBB, 0x00, 0x00, 0x00 };
int main( int argc, char **argv )
{
std::setlocale(LC_ALL, "en_US.utf8");
std::cout << "KANJI String [" << kanji_string << "]" << std::endl;
std::cout << "KANJI UTF-8 Raw [" << kanji_utf8_raw_bytes << "]" << std::endl;
const char *data = (char*)kanji_win_raw_bytes;
std::mbstate_t state = std::mbstate_t();
size_t result = 0;
wchar_t* buffer = (wchar_t*)malloc( sizeof(wchar_t) * (strlen((char*)data) + 1) );
if ( buffer )
{
result = std::mbsrtowcs(buffer, &data, strlen(data), &state);
if ( result == (size_t)-1 )
{
std::cout << "ERROR! mbsrtowcs() " << strerror(errno) << std::endl;
std::cout << "Error at: " << (int32_t)( (char*)data - (char*)kanji_win_raw_bytes ) << std::endl;
}
else
{
std::wcout << "Wide string: [" << buffer << "] " << std::endl;
}
free( buffer );
}
return 0;
}
Note: this can be compiled and run on Linux/Mac with the following command:
g++ mbcs_test.cpp -o mbcs_test && ./mbcs_test