MimeKit Character Encoding/Decoding Issue

Question

While using MimeKit to convert .eml files to .msg files, I'm running into an issue that appears to be related to encoding.

With an EML file containing the following, for instance:

--__NEXTPART_20160610_5EF5CF91_471687D
Content-Type: text/plain; charset=iso-2022-jp
Content-Transfer-Encoding: 7bit

添付ファイル名テスト

The result is garbage in the body content:

・Y・t・t・@・C・・・ｼ・e・X・g

Additionally, base-64 encoded ü characters are showing up as ?? when the EML file is read. I've downloaded the latest release of MimeKit, but it doesn't seem to make a difference.

The .eml files open properly with Outlook 2016, but using MimeKit does not appear to be able to read and decode the files properly.

The edit is very... nitpicky? I don't mind, but if we're going to nitpick can we at least make the nitpicking consistent? In other words, MimeKit was edited to `MimeKit` once, but another instance was left in the original font. Also, .eml was nitpicked to `.eml` in one instance, but not in the subsequent instance. Thanks — Michael Coles, May 19 '17 at 20:33

jstedfast · Accepted Answer · 2017-03-27T19:11:09.543

There are a few problems with your above MIME snippet :(

Content-Transfer-Encoding: 7bit is obviously not true, altho that's not likely to be the problem (MimeKit ignores values of 7bit and 8bit for this very reason).

Most importantly, however, is the fact that the charset parameter is iso-2022-jp but the content itself is very clearly not iso-2022-jp (it looks like utf-8).

When you get the TextPart.Text value, MimeKit gets that string by converting the raw stream content using the charset specified in the Content-Type header. If that is wrong, then the Text property will also have the wrong value.

The good news is that TextPart has GetText methods that allow you to specify a charset override.

I would recommend trying:

var text = part.GetText (Encoding.UTF8);

See if that works.

FWIW, iso-2022-jp is an encoding that forces Japanese characters into a 7bit ascii form that looks like complete jibberish. This is what your Japanese text would look like if it was actually in iso-2022-jp:

BE:IU%U%!%$%kL>%F%9%H

That's how I know it's not iso-2022-jp :)

Update:

Ultimately, the solution will probably be something like this:

var encodings = new List<Encoding> ();
string text = null;

try {
    var encoding = Encoding.GetEncoding (part.ContentType.Charset,
        new EncoderExceptionFallback (),
        new DecoderExceptionFallback ());
    encodings.Add (encoding);
} catch (ArgumentException) {
} catch (NotSupportedException) {
}

// add utf-8 as our first fallback
encodings.Add (Encoding.GetEncoding (65001, 
    new EncoderExceptionFallback (),
    new DecoderExceptionFallback ()));

// add iso-8859-1 as our final fallback
encodings.Add (Encoding.GetEncoding (28591, 
    new EncoderExceptionFallback (),
    new DecoderExceptionFallback ()));

for (int i = 0; i < encodings.Count; i++) {
    try {
        text = part.GetText (encodings[i]);
        break;
    } catch (DecoderFallbackException) {
        // this means that the content did not convert cleanly
    }
}

Thank you. The .eml file was created by a third party program, so I'll follow up with them; sounds like an issue with their app. — Michael Coles, Mar 27 '17 at 18:54
FWIW, I just updated my answer with a possible generic solution to your problem. — jstedfast, Mar 27 '17 at 19:11

MimeKit Character Encoding/Decoding Issue

1 Answers1