How to get decode this attachment filename with python?

Question

I already decoded a lot of email attachments filenames in my code.

But this particular filename breaks my code.

Here is a minimal example:

from email.header import decode_header
encoded_filename='=?UTF-8?B?U2FsZXNJbnZvaWNl?==?UTF-8?B?LVJlcG9ydC5wZGY=?='
decoded_header=decode_header(encoded_filename) # --> [('SalesInvoiceQ1|\x04\xb5I\x95\xc1\xbd\xc9\xd0\xb9\xc1\x91\x98', 'utf-8')]
filename=str(decoded_header[0][0]).decode(decoded_header[0][1])

Exception:

UnicodeDecodeError: 'utf8' codec can't decode byte 0xb5 in position 16: invalid start byte

Don't ask my how, but Thunderbird is able to decode this filename to: SalesInvoice-Report.pdf

How can I decode this with python like email clients apparently are able to?

Martijn Pieters · Accepted Answer · 2015-03-16T16:20:34.943

3

There are two Encoded-Word sections in that header. You'd have to detect where one ends and one begins:

>>> print  decode_header(encoded_filename[:28])[0]
('SalesInvoice', 'utf-8')
>>> print  decode_header(encoded_filename[28:])[0]
('-Report.pdf', 'utf-8')

Apparently that's what Thunderbird does in this case; split the string into =?encoding?data?= chunks. Normally these should be separated by \r\n (CARRIAGE RETURN + LINE FEED) characters, but in your case they are mashed up together. If you re-introduce the \r\n separator the value decodes correctly:

>>> decode_header(encoded_filename[:28] + '\r\n' + encoded_filename[28:])[0]
('SalesInvoice-Report.pdf', 'utf-8')

You could use a regular expression to extract the parts and re-introduce the separator:

import re
from email.header import decode_header

quopri_entry = re.compile(r'=\?[\w-]+\?[QB]\?[^?]+?\?=')

def decode_multiple(encoded, _pattern=quopri_entry):
    fixed = '\r\n'.join(_pattern.findall(encoded))
    output = [b.decode(c) for b, c in decode_header(fixed)]
    return ''.join(output)

Demo:

>>> encoded_filename = '=?UTF-8?B?U2FsZXNJbnZvaWNl?==?UTF-8?B?LVJlcG9ydC5wZGY=?='
>>> decode_multiple(encoded_filename)
u'SalesInvoice-Report.pdf'

Of course, it could be that you have a bug in how you read the header in the first place. Make sure you don't accidentally destroy an existing \r\n separator when extracting the encoded_filename value.

edited Mar 16 '15 at 16:20

answered Mar 16 '15 at 15:11

Martijn Pieters

1,048,767
296
4,058
3,343

I gave you a bounty for "Reward existing answer". Thank you very much! ... needs 23 hours ... strange restriction of Stackoverflow. – guettli Mar 18 '15 at 10:40
@guettli: thank you! You didn't really have to, but it is *much* appreciated. The restrictions are in place to combat fraud. If you could immediately award the community would not get a chance to audit bounties used to transfer rep for nefarious purposes. And look at it this way: the extra attention over the bounty period *can* result in extra upvotes, and perhaps someone knows an even better solution to the problem. :-) – Martijn Pieters Mar 18 '15 at 10:50
BTW, we already parsed several thousand file names in mail headers. I guess the client creates the broken encoding. – guettli Mar 18 '15 at 10:54
@guettli: yes, if a client is producing those headers *without* a CRLF in between then that's a broken client. It could also be that it produces the value with a line separator (correct or somehow malformed) and that a MTA in between then 'repairs' the value, etc. Either way, you now have to deal with the fall-out. – Martijn Pieters Mar 18 '15 at 10:59

How to get decode this attachment filename with python?

1 Answers1