Here is the first part of a Word .docx file, in hex:
50 4b 03 04 14 00 06 00 08 00 00 00 21 00 e1 0f
8e bf 8d 01 00 00 29 06 00 00 13 00 08 02 5b 43
6f 6e 74 65 6e 74 5f 54 79 70 65 73 5d 2e 78 6d
6c 20 a2 04 02 28 a0 00 02 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Note that each 2-digit value above is one character. The first two values -- 50 and 4b -- are the ASCII characters P and K. (Google "ASCII table" and you will see what I mean.)
Here is all the character data you can see:
PK[Content_Types].xml
If you look at the hex values, anything with a value above 0x7F is not a valid ASCII/UTF8 character. When such data is transmitted over the internet via certain protocols, the data is apt to get garbled (since protocols expect ASCII characters) unless it's somehow encoded into ASCII. This is the purpose of "Base-64".
Base-64 encodes the above data as:
UEsDBBQABgAIAAAAIQDhD46/jQEAACkGAAATAAgCW0NvbnRlbn
RfVHlwZXNdLnhtbCCiBAIooAACAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
This can be safely transmitted, since all values are regular ASCII characters (their numeric values are below 0x7f).
When you decode the Base-64 you presumably get back the same data that you started with, so if you write that data to a file you will have "reconstituted" the original .docx file.
If, on the other hand, you feed the decoded data (or data never encoded) into a byte to string converter (such as newStringUtf8
) then the characters larger than 0x7f are interpreted as UTF8 sequences and translated into the corresponding UTF16 or UTF32 characters. But "binary" data (such as the header data in a .doc or .docx file) is just numbers -- it's not character data. Converting those binary values to UTF characters produces nothing meaningful. Further, some of the values do not survive the conversion and will not convert back correctly.
The way to deal with this file is to "reconstitute" the .doc file from Base-64 form to "binary", write that data as a "binary" file. and then use software that understands how to read its header and take it apart sensibly. This would be either Word itself or some API written specifically to access the innards of Word files.