2

I seem to be having an issue converting a byte array (containing the text from a word document) to a LPTSTR (wchar_t *) object. Every time the code executes, I am getting a bunch of unwanted Unicode characters returned.

I figure it is because I am not making the proper calls somewhere, or not using the variables properly, but not quite sure how to approach this. Hopefully someone here can guide me in the right direction.

The first thing that happens in we call into C# code to open up Microsoft Word and convert the text in the document into a byte array.

byte document __gc[];
document = word->ConvertToArray(filename);

The contents of document are as follows:

{84, 101, 115, 116, 32, 68, 111, 99, 117, 109, 101, 110, 116, 13, 10}

Which ends up being the following string: "Test Document".

Our next step is to allocate the memory to store the byte array into a LPTSTR variable,

byte __pin * value;

value = &document[0];

LPTSTR image;
image = (LPTSTR)malloc( document->Length + 1 );

Once we execute the line where we start allocating the memory, our image variable gets filled with a bunch of unwanted Unicode characters:

췍췍췍췍췍췍췍췍﷽﷽����˿於潁

And then we do a memcpy to transfer over all of the data

memcpy(image,value,document->Length);

Which just causes more unwanted Unicode characters to appear:

敔瑳䐠捯浵湥൴촊﷽﷽����˿於潁

I figure the issue that we are having is either related to how we are storing the values in the byte array, or possibly when we are copying the data from the byte array to the LPTSTR variable. Any help with explaining what I'm doing wrong, or anything to point me in the right direction will be greatly appreciated.

1 Answers1

10

First you should learn something about text data and how it's represented. A reference that will get you started there is The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

byte is just a typedef or something for char or unsigned char. So the byte array is using some char encoding for the string. You need to actually convert from that encoding, whatever it is, into UTF-16 for Windows' wchar_t. Here's the typical method recommended for doing such conversions on Windows:

int output_size = MultiByteToWideChar(CP_ACP,0,value,-1,NULL,0);
assert(0<output_size);
wchar_t *converted_buf = new wchar_t[output_size];
int size = MultiByteToWideChar(CP_ACP,0,value,-1,converted_buf,output_size);
assert(output_size==size);

We call the function MultiByteToWideChar() twice, once to figure out how large of a buffer is needed to hold the result of the conversion, and a second time, passing in the buffer we allocated, to do the actual conversion.

CP_ACP specifies the source encoding, and you'll need to check the API documentation to figure out what that value really should be. CP_ACP stands for 'codepage: Ansi codepage', which is Microsoft's way of saying 'the encoding set for "non-Unicode" programs.' The API may be using something else, like CP_UTF8 (we can hope) or 1252 or something.

You can view the rest of the documentation on MultiByteToWideChar here to figure out the other arguments.


Once we execute the line where we start allocating the memory, our image variable gets filled with a bunch of unwanted Unicode characters:

When you call malloc() the memory given to you is uninitialized and just contains garbage. The values you see before initializing it don't matter and you simply shouldn't use that data. The only data that matters is what you fill the buffer with. The MultiByteToWideChar() code above will also automatically null terminate the string so you won't see garbage in unused buffer space (and the method we use of allocating the buffer will not leave any extra space).


The above code is not actually very good C++ style. It's just typical usage of the C-style API provided by Win32. The way I prefer to do conversions (if I'm forced to) is more like:

std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>,wchar_t> convert; // converter object saved somewhere

std::wstring output = convert.from_bytes(value);

(Assuming the char encoding being used is UTF-8. You'll have to use a different codecvt facet for any other encoding.)

Cristian Ciupitu
  • 20,270
  • 7
  • 50
  • 76
bames53
  • 86,085
  • 15
  • 179
  • 244
  • Out of interest, are there any Windows code pages for which a single byte / code point can require more than one UTF-16 code unit? I agree that calling MBTWC twice is the right thing to do, I'm just vaguely curious whether the result is all that unpredictable :-) – Steve Jessop Dec 14 '12 at 23:38
  • @SteveJessop Yes, CP_UTF8. Another question is if there's any locale's codepage on Windows supports any character that requires surrogate codepoints; I don't know the answer to that but if there are then it's a violation of the standard (C++11 § 3.9.1/5). – bames53 Dec 14 '12 at 23:45
  • Oops, I should have specified a code page *that could be CP_ACP*. What I'm getting at is, can the measurement actually ever return more than the length of the string if you pass in CP_ACP, or in that case is `MultiByteToWideChar` equivalent to `strlen(value)+1`? It sounds like the answer is the latter, supposing MS has conformed to the standard. – Steve Jessop Dec 14 '12 at 23:49
  • @SteveJessop `CP_ACP` corresponds to the locale encoding (`setlocale("")`) and so as I said before I'm unsure of the answer. But I suspect that the answer is 'no, CP_ACP does not support anything outside the BMP.' And I suspect the fact that characters outside the BMP can't be supported in any locale encoding/'ansi' codepage without either violating the standard (and breaking code that relies on §3.9.1/5) or switching away from UTF-16 for wchar_t, is the reason why UTF-8 will never be supported as a locale encoding/'ansi' codepage. – bames53 Dec 14 '12 at 23:54
  • @SteveJessop however that's not to say using `MultiByteToWideChar` to calculate the buffer length is the same as `strlen()+1`. Using `strlen()` will sometimes calculate a buffer size that is larger than necessary, because it will count every byte of a multibyte sequence as requiring its own wchar_t in the result, when in fact multiple bytes may correspond to a single `wchar_t`. – bames53 Dec 14 '12 at 23:59
  • High powered discussion for a question asking why malloc fills his array with funny characters! Maybe change the question to fit the answer so it's more discoverable? – Nicholas Wilson Dec 15 '12 at 00:02
  • @NicholasWilson Or ask a new question! – bames53 Dec 15 '12 at 00:08
  • @NicholasWilson: I completely forgot that vitally important part of the question. bames53, you didn't explain about uninitialized memory, and it's too late for me to take my +1 off due to this lack! – Steve Jessop Dec 15 '12 at 00:08
  • Thanks for all the help everyone. I've already learned so much (and it's just the tip of the iceberg, so to speak) – Christopher MacKinnon Dec 17 '12 at 15:20
  • 1
    MultiByteToWideChar name implies that the source data is 2 bytes per char in the first place and the output is the same format, but with each byte not having the split between bytes on the 8th bit. Basically microsoft string processing library is soo bad, i could cry. Under gcc you get none of these problems also you can fall back on c processing that actually works. But nooo, microsoft had to go and screw up their entire string processing libraries. – Owl May 12 '17 at 13:37