Decode Base 64 and encode as Utf-8 still leaves encoded characters

Question

I have a program that has to decode a base 64 string and then decode it again as UTF-8. The program is pulling the text from a .doc and then downloading it locally from Dropbox (using Temboo). There are still weird characters before and after the document. This is what a section of the page looks like in Microsoft Word 2011 for Mac:

Image of characters

I tried to put the text into text decoders online and couldn't seem to find what encoding the chunk of text above was. This is how I am currently decoding the text:

encoded = encoded.replaceAll("\r\n", "");
encoded = encoded.replaceAll("\n", "");
encoded = encoded.replaceAll("\r", "");

// decoding the response
decoded = StringUtils.newStringUtf8(Base64.decodeBase64(encoded));

In TextEdit.app it looks like this:

TextEdit

Does anyone know what encoding this is and how I can decode these characters?

It seems to be a binary header. Can this be a Microsoft Word document? — Jongware, Jul 31 '14 at 23:23
"Decode base 64 and encode as UTF-8" makes no sense. Base 64 is for encoding arbitrary binary data, and you cannot encode arbitrary binary data into UTF-8. — Hot Licks, Jul 31 '14 at 23:47
This question appears to be off-topic because it is about doing stuff that doesn't make sense. — Hot Licks, Jul 31 '14 at 23:48
My explanation might not have made sense but the pictures and coed I gave do. So please reverse your down vote because why you down voted doesn't make sense @HotLicks — heinst, Jul 31 '14 at 23:50
And that's the point of stackoverflow right? You try things and what you try didn't work so you run out of options and ideas and come here for more ideas and guidance. — heinst, Jul 31 '14 at 23:53
So please @HotLicks stop taking your angst or whatever is going wrong in your life out on me and go somewhere else and do that — heinst, Jul 31 '14 at 23:54
Actually, no. The purpose of SO is to create an archive of useful information. Garbling Base64 isn't useful. — Hot Licks, Jul 31 '14 at 23:56
And read my other comment below. It's probably the best advice you will get here. — Hot Licks, Jul 31 '14 at 23:56

score 3 · Answer 1 · answered Jul 31 '14 at 23:10

3

There is no base64 in your samples. I would recommend you use a Office format lib (like POI) to extract text/data from Office documents.

answered Jul 31 '14 at 23:10

eckes

10,103
1
59
71

The string that I am decoding is base64. I get the contents of a file through an api call and it is given to me in base64. And so I am trying to decode that string and decode it using stringAsUtf8. Then I write a .doc file using FileOutputStream. But I still get those random omega characters – heinst Jul 31 '14 at 23:20
Well, I dont understand how base64, DOC and UTF8 are related (normally they are not). So I suspect thats your problem. If this is a base64 decoded MS Word document, then write the bytes 1:1 without conversion to UTF8. And then you need to open it in a program which can actually read doc files (not a text editor). – eckes Jul 31 '14 at 23:42
2

Decode the Base64 to binary (byte[]) and write *that* to a file. Open with Word. – Hot Licks Jul 31 '14 at 23:54

score 2 · Accepted Answer · answered Aug 01 '14 at 00:52

Here is the first part of a Word .docx file, in hex:

50 4b 03 04 14 00 06 00 08 00 00 00 21 00 e1 0f
8e bf 8d 01 00 00 29 06 00 00 13 00 08 02 5b 43
6f 6e 74 65 6e 74 5f 54 79 70 65 73 5d 2e 78 6d
6c 20 a2 04 02 28 a0 00 02 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Note that each 2-digit value above is one character. The first two values -- 50 and 4b -- are the ASCII characters P and K. (Google "ASCII table" and you will see what I mean.)

Here is all the character data you can see:

PK[Content_Types].xml

If you look at the hex values, anything with a value above 0x7F is not a valid ASCII/UTF8 character. When such data is transmitted over the internet via certain protocols, the data is apt to get garbled (since protocols expect ASCII characters) unless it's somehow encoded into ASCII. This is the purpose of "Base-64".

Base-64 encodes the above data as:

UEsDBBQABgAIAAAAIQDhD46/jQEAACkGAAATAAgCW0NvbnRlbn
RfVHlwZXNdLnhtbCCiBAIooAACAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

This can be safely transmitted, since all values are regular ASCII characters (their numeric values are below 0x7f).

When you decode the Base-64 you presumably get back the same data that you started with, so if you write that data to a file you will have "reconstituted" the original .docx file.

If, on the other hand, you feed the decoded data (or data never encoded) into a byte to string converter (such as newStringUtf8) then the characters larger than 0x7f are interpreted as UTF8 sequences and translated into the corresponding UTF16 or UTF32 characters. But "binary" data (such as the header data in a .doc or .docx file) is just numbers -- it's not character data. Converting those binary values to UTF characters produces nothing meaningful. Further, some of the values do not survive the conversion and will not convert back correctly.

The way to deal with this file is to "reconstitute" the .doc file from Base-64 form to "binary", write that data as a "binary" file. and then use software that understands how to read its header and take it apart sensibly. This would be either Word itself or some API written specifically to access the innards of Word files.

Thanks for answer. Writing out the byte [] instead of trying the utf 8 way. Thanks again and sorry about before...we all know how frustrating programming can be at times — heinst, Aug 01 '14 at 01:28
@heinst - It's important to learn how to ask the right questions. — Hot Licks, Aug 01 '14 at 01:30
I thought I explained it clearly, or what I thought was clear. Sorry — heinst, Aug 01 '14 at 01:33
@heinst - There were several things wrong with your question. For instance, the first data you presented was not adequately explained. I still can't tell if it's supposedly the Base-64 data, your UTF8 data, or something else. You jumped around, based on your (false) assumptions, rather than laying out the situation in a logical sequence so others could understand it. — Hot Licks, Aug 01 '14 at 01:45

Decode Base 64 and encode as Utf-8 still leaves encoded characters

2 Answers2